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Abstract 

We introduce a framework to build a survival/risk bump hunting model with a censored time-to-event 
response. Our Survival Bump Hunting (SBH) method is based on a recursive peeling procedure that 
uses a specific survival peeling criterion derived from non/semi-parametric statistics such as the hazards- 
ratio, the log-rank test or the Nelson-Aalen estimator. To optimize the tuning parameter of the model 
and validate it, we introduce an objective function based on survival or prediction-error statistics, such 
as the log-rank test and the concordance error rate. We also describe two alternative cross-validation 
techniques adapted to the joint task of decision-rule making by recursive peeling and survival estima¬ 
tion. Numerical analyses show the importance of replicated cross-validation and the differences between 
criteria and techniques in both low and high-dimensional settings. Although several non-parametric 
survival models exist, none addresses the problem of directly identifying local extrema. We show how 
SBH efficiently estimates extreme survival/risk subgroups unlike other models. This provides an in¬ 
sight into the behavior of commonly used models and suggests alternatives to be adopted in practice. 
Finally, our SBH framework was applied to a clinical dataset. In it, we identified subsets of patients 
characterized by clinical and demographic covariates with a distinct extreme survival outcome, for which 
tailored medical interventions could be made. An R package PRIMsrc is available on CRAN and GitHub. 
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1 Introduction 


Non-Parametric Methods for Bump Hunting 

The search for data structures in the form of bumps, modes, components, clusters or classes are important 
as they often reveal underlying phenomena leading to scientific discoveries. It is a difficult and central 
problem, applicable to virtually all sort of exact and social sciences with practical applications in various 
fields such as finance, marketing, physics, astronomy and biology. 

It is common to treat the task of finding isolated data structures with the help of a response function as 
in a regression or classification problem or simply a probabilistic model as in a density estimati on p roblem. 


Among the non-parametric unsupervised methods, this can be done by testing modality 10|, l35l . l58l . l6ll |. 
using nonparametric mixture models (see e.g. j3] for a review), pattern recognition or clustering. However, 
beside the limitations or problems encountered by these methods in higher dimensional settings, model 
fitting e.g. of finite mixture models is challenged by the estimation of the true number of components 
17l |. A similar situation exists for clustering procedures where the true number of clusters is unknown. 
Moreover, unsupervised methods may also fail to capture true data structures simply by ignoring a response 
if available 17j. Althou gh non-parametric supervised approaches such as, for instance, decision trees and 
their ensemble versions [gja], do not have this drawback, these classification and regression procedures 
may also perform poorly [l7| since they are designed to work when the true number of classes is fixed or 
assumed in advance. 

Exploratory supervised bump hunting procedures are among the few non-parametric methods that 
have been proposed to address this problem. These methods seek bump supports (possibly disjoint) of 
the input space of multi variables where a target function (e.g. a regression or density function) is on 
average larger (or lower) than its average over the entire input space. They cover tasks such as: (i) 
Mode(s) Hunting, (ii) Local/Global Extremum(a) Einding, (iii) Subgroup(s) Identification, (iv) Outlier(s) 
Detection. One known as the Patient Rule Induction Method (PRIM) was initially introduced by Priedman 
Sz Eisher [3l| and later formalized by Polonik [^. Essentially, the method is a recursive peeling algorithm 
that explores the input space to find rectangular solution regions where the response is expected to be 
larger on average. Some interesting features common and distinct to decision trees such as Classification 
and Regression Trees (CART) help describe PRIM. As a rule-induction method like CART, PRIM 
generates simple decision rules describing the solution region of interest. Eurther, like CART, PRIM is a 
non-parametric procedure, algorithmic in nature (backwards fitting recursive algorithm), which makes few 
statistical assumptions about the data. Although PRIM does not explicitly state a model as CART, one 
can be formulated [^ . 73]. Both algorithms/models have the possibility to recover complex interactions 
between input variables. Basic difference between the two methods lies in their approach and goal (reviewed 
in section E^T]) . 

To date, only a few extensions of the original PRIM work have been done: This includes a Bayesian 
model-assisted formulation of PRIM ^ , a boosted version of PRIM based on Adaboost 0 , an extension 
of PRIM to censored responses and to discrete variables [^. Although PRIM is intrinsically 

multivariate, it was uncertain from the original work how the algorithm would perform in ultra high- 


dimension where collinearity [30|, l3l| and sparsity abound. So, recently, an interesting body of work 


studied when and why the Principal Component space can be used effectively to optimize the response- 
predictor relationship in bump hunting. This was first addressed in 171], where the computational details of 


such an approach were laid out for high-dimensional settings. Eurther, focusing on the properties of PRIM, 
authors demonstrated using basic geometrical arguments how the PC rotation of the predictor space alone 
can generate “improved” bump estimates 0 . 20 j. These developments have important implications for 


general supervised learning theory and practical applications. In fact, 17| first used a sparse PC rotation 


for improving bump hunting in the context of high dimensional genomic predictors and later showed how 
this approach can be used to find additional heterogeneity in terms of survival outcomes for colon cancer 
patients ( 0 ). 
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Model Development and Validation in Discovery-Based Research 

The primary problem encountered in discovery-based research has been non-reproducible results. For 
instance, early biomarker discovery studies using modern high-throughput datasets with large number 
of features have often been characterized by false or exaggerated claims and eventually disappointment 
when original results could not be reproduced in an independent study 


21|, 123|, 128|, 133|, 1491, IM, 


631, IM, 


Sadly, these results have been published even in high-profile journals and considered to provide definitive 
conclusions for both clinical care and biology. The problems of model reliability and reproducibility have 
usually been characterized by issues of severe model over-htting, biased model parameter estimates and 
under-estimated errors. This has been attributed to a lack of proper rules to assess the analytical validity 
of studies simply because they were either under-developed or not routinely/correctly applied S, This 
problem first received the attention of statisticians (see for instance reviews on guidelines and checklists 
BS0) as well as editors and US regulators lately [^ . 


Meanwhile, considerable development work has been done in the helds of feature selection, predictive 
model building and model validation to resolve the aforementioned issues. Recent developments include 
strategies such as variable/feature selection, dimension reduction, coefficient shrinkage and regularization. 
The challenge is obviously more acute in the context of high-dimensional data where the number of variables 
greatly exceeds the number of observations (so-called p » n paradigm), since usually only a small number 
of variables truly enter in the model, while the large majority of them just contribute to noise. This noisy 
situation is even more complicated by the multicollinearity and spurious correlation between variables as 
well as the endogeneity between variables and model residual errors (see e.g. 29| for a recent review). 

A common situation where model reliability and reproducibility arise is when, for instance, model 
performance estimates are calculated from the same data that was used for model building, eventually re¬ 
sulting in initially promising results, but often non-reproducible BSQ- These so-called “resubstitution 
estimates” are severely (optimistically) biased. Another problematic situation is when not all the steps of 
model building (such as pre-selection, creation of the prediction rule ar id p arameter tuning) are internal to 
the cross-validation procedure, thereby creating a selection bias [i, 71 1. In addition, findings might not 

be reproducible even when proper independent sample and validation procedures are used. Problems may 
arise simply because cross-validated estimates are well-known to have large variance, a situation that is 
obviously more prevalent when few independent observations or small sample size n are used 22, 124, |. 


Predictive Survival/Risk Modeling by Rule-Induction Methods 

One important application of survival/risk modeling is to identify and segregate samples for predictive 
diagnostic and/or prognosis purposes. Direct applications include the stratihcation of patients by diagnostic 
and/or prognostic groups and/or responsiveness to treatment. Therefore, survival modeling is usually 
performed to predict/classify patients into risk or responder groups (not to predict exact survival time) from 
which one usually derives survival/risk functions estimates (e.g. by Kaplan-Meier estimates). However, 
for the reasons mentioned above, Kaplan-Meier estimates for the risk groups computed on the same set of 
data used to develop the survival model may be very biased [^, l71| . 

In the context of a time-to-event outcome, regression survival trees have proven to be useful. Several 
dev elop ments have been made for fitting decision trees to non-informative censored survival times [l|. 


dev elop 

ni, m 


32l . l45l . |46| . l62l |. Although regression survival trees are powerful techniques to understand for 


instance patient outcome and for forming multiple prognostic groups, often times interest focuses only on 
estimating extreme survival/risk groups. In this respect, survival bump hunting aims not at estimating 
the survival/risk probability function over the entire variable space, but at searching regions where this 
probability is larger (or smaller) than its average over the entire space. 

Also, one possible drawback of decision trees is that the data splits at an exponential rate as the 
space undergo partitioning (typically by binary splits) as opposed to a more patient rate in decision 
boxes (typically by controlled quantile). In this sense, bump hunting by recursive peeling may be a 
more efficient way of learning from the data. With the exception of the work of LeBlanc et al. on 
Adaptive Risk Group Rehnement [i^, it has not been studied whether decision boxes, obtained from box- 
structured recursive peelings, would yield better estimates for constructing prognostic groups than their 
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tree-structured counterparts. 

Although resampling methods are often useful in assessing the prediction accuracy of classifier models, 
they are not directly applicable for predictive survival modeling applications. Simon et al. have reviewed 
the literature of such applications and identihed serious dehciencies in the validation of survival risk models 


[66|. They noted for instance that in order to utilize the cross-validation approach developed for 


classification problems, some studies have dichotomized their survival or disease-free survival data .... 
The problem on how to cross-validate the estimation of survival distributions (e.g. by Kaplan-Meier 
curves) is not obvious 651]. In addition, beside Subramanian and Simon’s initial study on the usefulness of 
resampling methods for assessing survival prediction models in high-dimensional data 67||, no comparative 
study has been done for rule-induction methods and specifically recursive peeling methods such as our 
“Patient Recursive Survival Peeling” method (see section 12.2.71) . 

Contribution and Scope 

Our survival/risk bump hunting model is built upon the regular bump hunting framework, which we 
extended to accommodate a possibly censored time-to-event type of response. To build our survival/risk 
bump hunting model, we first describe our “Patient Recursive Survival Peeling” (PRSP) method, a non- 
parametric recursive peeling procedure, derived from a rule-induction method, namely the Patient Rule 
Induction Method (PRIM), which we have extended to allow for survival/risk response, possibly censored. 
In the process, we describe what appropriate survival estimator and statistic may be used as a peeling 
criteria to fit our survival/risk bump hunting model. 

One of the critiques made in the original PRIM work was the lack of validation procedure and measures 
of significance of solution regions. So, our objective was also to develop a validation procedure for the 
purpose of model tuning by means of an optimization criterion of model parameters tuning and a resampling 
technique amenable to the joint task of decision rule making by recursive peeling (i.e. decision-box) 
and survival estimation. Specifically, we describe here two alternative, possibly repeated, A'-fold cross- 
validation techniques adapted to the task, namely the “Replicated Combined CV” (RCCV) and “Replicated 
Averaged CV” (RACV). Moreover, we show how to use survival end-points/prediction statistics for the 
specific goal of model peeling length optimization by cross-validation. 

Results support the claim that optimal survival bump hunting models may be reached using appropriate 
combination of criterion and technique under certain situations, for which we provide guidelines. Finally, we 
show empirical results from a real dataset application and from simulated data in low- and high-dimension, 
illustrating the efficiency of our cross-validation and peeling strategies and the adequacy of our survival 
bump hunting framework in comparison to other available non-parametric survival models. 

We do not describe nor discus the specific treatment of dimension-reduction or variable selection in 
high-dimensional settings for the only reason that the focus of this study is on cross-validation and peeling 
strategies. Even though the issue of model unreliability is known to be more severe when there is a large 
number of variables to choose from 64], it is known to persist even in low-dimensional setting [^. So, 
we posit that the framework described here is relevant and applicable to both low and high-dimensional 
situations. Nevertheless, the method does include cross-validation procedures to control model size 
covariates) in addition to model complexity (# peeling steps). It has been tested in multiple (> 20) low 
and high-dimensional situations where n ^ p and even n « p (see abstract of application article and 
our example datasets in our R package PRIMsrc 0 ) and we show empirical analyses in high-dimensional 
simulated datasets where n < p. 


2 Survival Bump Hunting for Exploratory Survival Analysis 

2.1 Bump Hunting Model 

2.1.1 Notation - Goal 

The formal setup of bump hunting is as follows [see also l3l|, l59l] . Let us consider a supervised problem with 
a univariate output (response) random variable, denoted y e R. Further, let us consider a p-dimensional 
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random vector X 6 R*’ of support S, also called input space, in an Euclidean space. Let us denote the p 
input variables by X = of joint probability density function p(X) and by /(x) = E(y|X = x) the 

target function to be optimized (e.g. any regression function or e.g. the p.m.f or p.d.f /x(x)). 


Briefly, the goal in bump hunting is to find 
a sub-space or region {R ^ S) of the input 
space within which the average value /r of 
/(x) is expected to be significantly larger (or 
smaller) than its average value fs over the 
entire input space S (Figure [I]). In addition, 
one wishes that the corresponding support 
(mass) of R, say I3r, be not too small, that 
is, greater than a minimal support threshold, 
say 0 < /?o < 1. 

Formally, in the continuous case of X: 

Ir = 

Pr = 

In supervised problems with an output variable (response) y, one would seek to characterize the condi¬ 
tional expectation F(y|X = x) and infer the properties of the unknown joint probability density function 
p(X), whereas in the case of unsupervised learning, one would have to directly infer the properties of p(X), 
e.g. from some density estimate, without the help of a response. 

Let Sj be the support of the jth. variable Xj, such that the input space can be written as the (Cartesian) 
outer product space S = Let Sj ^ Sj denotes the unknown subset of values of variable Xj 

corresponding to the unknown support of the solution region R. Let J ^ {1,... ,p} be the subset of indices 
of selected variables in the process. The goal in bump hunting amounts to finding the value-subsets {sj]jej 
of the corresponding variables {xj}j^j such that 

^ = |n ^ Is){.Pr » /3o)| (3) 


m 



Figure 1: Schematic representation of bump hunting in the unidi¬ 
mensional case (p = 1), where the target function /(x) is a regression 
function of x and the estimated region R is a contiguous interval 
(red segment) corresponding to larger values of /(x) on average. The 
support (5 r of R and the average values /r and fs are shown. 




p(x)dx » /3o 


» fs 


I . 


( 1 ) 

( 2 ) 


2.1.2 Estimates 

Since the underlying distribution is not known, the estimates of /r and f^R must be used. Assume a 
supervised setting, where the outcome response variable is y = and the explanatory/input 

variables are X = (xi...x„)^, where each observation is the p-dimensional vector of covariates Xj = 
\xip ... Xi^p^, for z e {1,..., n}. Plug-in estimates of the average value Jr of the target function /(x) and 
of the support I3r (eq. [2]) of the region R are respectively derived as: 
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2.1.3 Remarks 


1 . 


3. 


4. 


is the usual interval of the form Sj = 
a I J|-dimensional hyper-rectangle in R 


The goal amounts to comparing the conditional expectation of the response over the solution region 
R: fji = E[f{-x)\x e i?] with the unconditional one fs = -E[/(x)]. 

Larger target function average Jr is associated with smaller support /3r of the region R (Figured]). 
So, in practice, there is a trade-off between maximizing /r and maximizing Pr. 

If the target function to be optimized is for instance the p.m.f or p.d.f /x(x), then Pr(x e R) is the 
probability mass/density of a local maximum and the task is equivalent to a mode(s) hunting. 

In the case of real-valued inputs, the entire input space is the p-dimensional outer product space 
S ^ RP; the support Sj of each individual input variable (and of each corresponding value-subset Sj) 

c R for j E J] the solution region R has the shape of 

■^1, called a box, denoted B, which can be written as the outer 
product of |J| intervals of the form B = X 

In general, region R could be any smooth shape (e.g. a convex hull) possibly disjoint. Describing 
or modeling such region would be difficult in high dimension and especially when the number of 
variables is larger than the number of observations {p » n paradigm). In general, there is a trade-off 
between the goodness of fit and the interpretability of the inferences that we want to make. Here, we 
focus on interpretable models based on rectangular boxes in the input space of variables. Typically 
these rectangular boxes are aligned to the coordinate axes, but an immediate extension is to use 
linear combination rules of variables, i.e. a rotated space of input variables, such a the principal 
components space. We have showed that this strategy may provide a more favorable space to learn 
from the data (see for instance ( 13 - 0 ). 


2.1.4 Estimation by the Patient Rule Induction Method (PRIM) 

The Patient Rule Induction Method (PRIM) is used to get the region estimate R with corresponding 
support estimate (^r and conditional output response mean estimate fR. Essentially, the method is one 
of recursive peeling/pasting algorithm (a discrete version of the steepest ascent method) that explores the 
input space solution region, where the response is expected to be larger on average. The method generates 
a sequence of boxes that collectively cover the region estimate R. The way the space is covered and the box 
induction is done as well as how the patience and stopping rules are controlled is detailed in the original 
article of Friedman & Fisher 3l|, later formalized by Polonik & Wang 59|. 


Covering - Coverage Stopping Rule. A sequence of boxes generated from the data {x*, 

to collectively cover the solution region R. Starting from an initial box Bi that covers all the data, the 
box sequence construction algorithm is recursively applied to subsets of the data as follows. At the mth 
iteration (m > 1), a box Bm is induced (by the top-down peeling algorithm - see next) using the data 
remaining after removal of all the observations contained in the previous boxes: {(yi,Xj): Xj ^ Br}. 

At the Mth iteration of the covering loop, the box sequence stops either (i) when the estimated 

individual box support /3 m becomes too small, say less than an arbitrary threshold 0 < /3o < 1 expressed 
as a fraction of the entire data: /3 m < Po, where, or (ii) when the estimated box output mean becomes 
too small, say ^m < V where y = ^ Yji=i Vi i® the global mean, where 


M-l 


1 / 

/3m = - 2 3^ ( Xj e Rm & Xj ^ U ^ 


Vm — 


2 = 1 
1 


m=l 


M-l 


nl^M j=i 


^ yj Xj 6 Bm & Xj ^ y 


m=l 


Box Induction. To induce the box B^a at the mth iteration (m > 1), the top-down peeling algorithm 
generates a subsequence of nested boxes starting from an initial box Bm,i that covers all the 

data remaining at the mth iteration of the covering loop. How L is estimated is the subject of section [321 


6 







At the Ith. iteration, a sub-box is peeled off (see next) from within the current box Bm,i to produce the 
next smaller box The particular sub-box b^i is chosen to yield the largest box output mean value 

ym,i+i within the next box Bm,i+i, such that; 


1 n / i 

ym,l+l ~ ^ ^ & Xj ^ I J B^ f) 

nl3m,i+i V b=i 

Bm,i +1 = Bm,i\bm,i , where 

where C{bm,i) represents the class of potential sub-boxes bm,i eligible for removal at step or generation (m, 1) 
and ‘\‘ represents the set minus operator. The current box Bm,i is then updated: Bm,i+i = Bm,i+i\b^ i 
and the peeling procedure is looped until some stopping rule is met (see next). Because a top-down peeling 
is a greedy search algorithm, it may cause overfitting, so a bottom-up pasting is applied to the minimal 
candidate box to repeatedly expand along any edge until the expansion fails to increase the output response 
average within the box. 

Patience - Induction Stopping Rule. There are two important meta-parameters that control the box 
induction algorithm; (i) the peeling fraction 0 < oq < 1 that controls the degree of patience, and (ii) the 
minimal box support threshold 0 < /3o < 1, expressed as a fraction of the whole data that is used in the 
stopping criterion (see next). Only a quantile oq of the data that is in the box Bm,i is peeled off at the 
Th iteration of the peeling loop as follows. Each eligible sub-box bm,i is defined by a single input variable 
Xj. For real valued variables, there are two eligible sub-boxes bj^i e C{bm,i) and bj^i e C{bm,i), which 
respectively border the lower and upper boundaries of the box Brn,i on the jth input variable xj: 


j 

I*'!;™.. 


|x: Xj < x^ 
{x: Xj > x^ 


(«o)i 
j > 
(l-ao)i 


where and x^-^ are respectively the aoth and (1 — ao)th quantiles of the Xj values. At the 

Lth iteration of the peeling loop, the box sequence stops when the estimated individual support 

ldm,L of the last box Bm,L becomes too small, say (Im,L < Po, where /3o is an arbitrary minimal box support 
threshold: 


/3m ,1 = ^ SLi ^ ^m,i) for L = 1 

Pm,L = ^ ^ ^rn,L & Xj ^ Bm,l) for L > 1 


Note that, with our notation, the last box Bm,L of the subsequence is also the next box of the outer 
box sequence {Bm}m=i- So, Bm,L = Bm+i, and similarly and /3m,L = /3m+i- 

Decision Rules. It is desirable that the solution region R be described in an interpretable form by 
logical statements involving the value-subset of each selected input variable. The above algorithm results 
in simple decision rules of the input space, where each box Bm, m = 1,... ,M, is described by the outer 
product of the value-subsets sj^m of each individual input variable Xj, for j e J. The idea is to describe 
the solution region i? by a disjunctive rule of M conjunctive subrules of the form TZ = lJm=i3^m, where 
IZm = {xeBm} = njej(xj £ Sj,m)- In the case of real-valued input variables, each subrule becomes 
TZm = ^ [^7m’^/m]) the solution region R is fully described by the disjunctive rule: 

M M f 

^ ^ U ^ U I ^ [^j,mT^m]) 

m=l m=l vjgJ 
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2.2 Survival Bump Hunting by Recnrsive Peeling 

Assume a supervised problem, where the function of interest is a univariate survival/risk response variable 
(possibly censored) in a multivariate setting of real-valued (continuous or discrete) input variables/covariates 
X = The goal is to characterize an extreme-survival-response support in the predictor space and 

identify the corresponding box-defined group of samples using a recursive peeling method derived from the 
Patient Rule Induction Method (PRIM). 


2.2.1 Survival-Specific Peeling Rule 

As mentioned in the introduction, rule-induction methods such as decision tree-based methods have proven 
to be useful to estimate relative risk in groups in the context of a time-to-event outcome. Several methods 
have been proposed for htting trees to non-informative censored survival times QJ [nl, El, Ei, m, Slei]. 

Basic differences between decision-tree and decision-box methods lie in their approach and goal. In¬ 
stead of recursively partitioning the space using specific partitioning and stopping criteria, one proceeds by 
recursively peeling the space to produce box-shaped regions designed to approximate the solution region R, 
using specific peeling and stopping criteria (see details in l2.1.4p . In decision-trees, a recursive partitioning 
method attempts to model the target function over the entire data space by generating partitions in which 
the response averages will be as different as possible, while in decision-boxes, a recursive peeling method 
generates a box-shaped region in which the response average will be as extreme as possible. So, in contrast 
to survival decision-trees models, survival bump hunting is not aimed at estimating the survival/risk prob¬ 
ability function over the entire covariate space, but at finding regions where this probability is larger than 
its average over the entire space. Numerical analysis below ([4]) show comparisons of relative risk estimates 
obtained from decision-boxes versus those obtained from decision-trees. Other interesting differences lie in 
the weaknesses and strengths of the outputs and their applications, which we left for discussion ([6]). 

In this section, we describe the use of several candidate survival-specihc peeling criteria and discuss 
their merits or strengths. Most of these criteria are borrowed from the survival splitting rules used to 
grow regression survival trees |il,|4i, 0,[63,|63] or from their ensemble versions 4^. Here, survival-specific 
peeling criteria are to be used to decide which covariate will be selected to give the best peel between two 
boxes from two consecutive generations (parent-child descendance) of the box induction/peeling loop in a 
recursive peeling algorithm (see next section r2. 2. 7p . 

To account for censoring, we simply supervise by proxy for extreme time-to-event outcome, turning the 
censored outcome y into an uncensored “surrogate” outcome z. Using previous notation Isection [2.1.411 . 
a peeling at step {m,l) of the box induction/peeling sequence produces a partition of the survival data 
from the parent box into two partitions, for a given set of covariates: the child box Bjn,i and its 

complement. The focus is on selecting a sub-box bm,i at step (m, 1) of the box induction/peeling sequence 
that is to be peeled off from the parent box along one of its faces (i.e. direction of peeling := axis 

of dimension j) to induce the next child box Bjn,i and its complement. This is done by maximizing the 
“surrogate” outcome rate of increase between two consecutive generations of boxes Bm,i-i and Bm,i of 
the box induction/peeling sequence. Denote by z{m,l) the box “surrogate” outcome at step or generation 
(m, Z) of the box induction/peeling sequence (Algorithm [I]). The rate of increase in z{m,l) at step or 
generation (m, 1) between two consecutive generations of boxes B^^i-i and Bm,i is defined as: 


r(m, 1) = 


z{m, 1) — z{m, I — 1) 
/3m, Z 


( 6 ) 


Finally, the particular sub-box 6^ ^ that is chosen to yield the largest box increase rate r{m, 1) between 


box Bm 1-1 and the next one Bm i is such that 


Bm,i = Byn,i-i\Kn,i > where 
Ka,l= argmax [r(m,Z)], 


(7) 


where C{bm,i) represents the class of potential sub-boxes bm,i eligible for removal at step or generation 
(m,Z). 

























2.2.2 Survival Notation and Definitions 

Let’s denote the two child boxes described above by where, by convention, let’s decide that 

subscript g = 1 stands for the “in-box” Bm,i and g = 2 for its complement or “out-of-box”. Dropping 
further step subscripts (m, 1) for simplicity, assume that there are n individuals in parent box and 

Ug in a given child box Bg^rn,i for fixed g e { 1 , 2 } such that n = ^ 3 - Also, we let 'ji^g) = / (xj e Bg^jn,i) 

be the indicator function of individual subject i within a given child box at step (m, 1) for fixed 

ge{l,2}. 

The response variable being subject to censoring, we use the general random censoring model. We 
focus on a univariate right-censored survival outcome under the assumptions of independent observations, 
non-competitive risks and random (type-I or -II) non-informative censoring. Denote the true survival time 
(or lifetime/failure time) by the random variable T and the observed censoring time by the random variable 
C, then the observed survival time is the random variable Y = min(r, C). Also, under our assumptions, C 
is assumed to be independent of T conditionally on covariates X. Let the observed event indicator random 
variable he A = I{T ^C). 

For each observation i 6 {1,..., n} in parent box Bm,i-i, the true survival time, observed censoring time, 
observed survival time and observed indicator event are the realizations denoted by Tj, Ci, Yi = min(rj, Ci) 
and 5i = I{Ti ^ Ci), respectively. Also, denote by < t{ 2 ) < ■ ■ ■ < for N ^ re the distinct ordered 
event times of death (not counting censoring times) in parent box Bm,i-i- Note that intervals between 
events for h e {1,... , N} are not necessarily uniform. Finally, the observed data in parent box Bm,i-i 
consists of {Yi, dj, Xj), where x* = for i e {1,..., re}. 

Let 5i^g = 'yi{g)I{Ti ^ Ci) be the observed indicator event of time point Ti for each individual i e 
Ug] in a given child box foi' g ^ {Ij 2}. Also, let dh^g and n^^g be respectively the number 

of events (deaths) and individuals at risk at time for h e {l,...,Ng} in a given child box Bg^m,i 
for g e {1,2}, such that N = Xlg=iAg. For simplicity, let’s use the same subscript h e {!,..., A^} 
and i e {!,...,re^} from parent and child boxes for indexing events and individuals, respectively. Note 
that Uh^g is the number of individuals in child box Bg^rn,i who either have not yet had an event (or 
been right-censored) just until time or who had an event at time Formally, if considering all 

individuals in child box Bg^rn,i only, nh,g = ^ ^(Zi)) or, if considering all individuals in parent 

box Sp,m,z-i, Uh^g = Yli^iliig)! {Yi>ti^h))- Likewise, one can write dh,g = 2"=! ^ ^(Zi)) > or 

dh,g = Xir=i ^mig)! {Yi ^ t(zi)) • Also, denote dh = dh,g and re^ = Xip=i nh,g. 

Let S{t) = Pr(r ^ t) be the survival probability that an individual from the population of interest will 
have a lifetime T free of the event until time t. As usual, denote by A{t) = — log(5(t)) the corresponding 
cumulative hazard function and by X{t) = the hazard rate function. To come up with decision-box 
survival-specific peeling criteria (see next section [ 2 . 2 . 3p . the following non/semi-parametric estimators can 
be used with respect to the box-defined subgroups: the Nelson-Aalen estimator, denoted by Hg,m,i{t), 
to estimate the cumulative hazard function; and the hazard rate function estimator derived from a Cox 
Proportional Hazards (CPH) regression model. By definition, these estimators are given as follows for 
individuals in a given child box Bg^rn,h for fixed g e { 1 , 2 }, at step (m, /): 


Hg^m,i{t) — 


^h,g 




As usual, the hazard rate function may be estimated by regressing the subject-specific hazard rate on 
the covariates in a CPH regression model, assuming proportional hazards 0 . With the above notation, 

Aj,g,m,z(t|xj) = Ao(t)exp[?7g,m,z(xi)] 

= Ao(t) exp [gg,m,ni{g)] 

where the regression function gg^m,l{^i) = ^ Ylj=iVj,g,m,lXi^j with p-dimensional vectors of regres¬ 
sion coefficients 'ng,m,i = [z?i,g,m,z • • • gp,g,m,iY' covariate Xj = [xi^i... reduce respectively to a 

scalar Tjg ^^ i = gi^g^rn,i = Vg,m,i times a simple box indicator variable Xj = Xi^i = I {xi e Bg^^^i) = 7j(g). 
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2.2.3 Non-Parametric Survival Peeling Criteria 

The choice of uncensored surrogate outcome z{m,l) in equation [6l that is, which estimator to choose as a 
box peeling criterion at a peeling step (m, 1), is central to the PRSP algorithm (see Algorithm [1]). Currently, 
our Survival Bump Hunting implementation in our R package PRIMsrc offers three statistics derived 
from the above non/semi-parametric estimators: (i) the Log-Rank Test statistic, (ii) the Nelson-Aalen 
Summary statistic and (iii) the CPH-derived Log Hazard Ratio statistic (assuming proportional hazards). 
• The (two-sample) log-rank test can be used at a peeling step (m, 1) to compare estimates of the hazard 
functions from each child box-defined subgroups (“in-box” and its “out-of-box” complement). We 


recently proposed to use it as a survival-specific box peeling criterion [1^. Using the (two-sample) 


log-rank test statistic is actually a natural candidate for survival decision-box, having been a well- 
established concept for splitting trees in survival decision-trees [l, 45, 4^, 69] and for being robust 

in non-proportional hazard settings [i^. The approximate lo g-ra nk test introduced by LeBlanc and 


Crowley can be used instead to greatly reduce computations 46|. Formally, one can derive the Log- 
Rank Test (LRT) statistic, denoted XLRT{m,l), for the individuals in a given child box for 

fixed g e {1,2} (e.g. g = 1), at step {m,l) as follows: 


XLRTim,l) = 


s;pN 

2-ih=i 




VsL.«m^(i-^)( 


rih-dh 

rih-^ 


( 8 ) 


If the Nelson-Aalen estimator is used, one can derive an overall summary statistic across all observed 
time points Yi for i e {1 ,..., Ug] of the individuals in a given child box Bg^rn.u for fixed g e (1, 2} 
(e.g. g = 1), at step This is done by adding the Nelson-Aalen estimators over all these time 

points to obtain a so-called Cumulative Hazard Summary (CHS), denoted Acnsifn-,^)'- 


ni 

AcHsi’mJ) = 2 -^l,m,/(bi) 

ni / 


S E 


i=l 


dh,i 

Rhi 


h:th^Yi 

sA ( Sp di,ld iXi ^ t{h)) 

ni 


E 

n\ 

E-'-. 


(9) 


2=1 


Alternatively, the use of a Hazard Ratio or Relative Risk was originally proposed by LeBlanc et al 
471, 148(1 . If the Cox-PH hazard rate estimate is used, then one derives the Log-Hazard Ratio (LHR) 
statistic, denoted Xlhr{jrA)^ for t^ie individuals in both child boxes Bg m,h for 9 £ {1)2}, at step 


>^LHRim,l) = log 


= log 


Ao(t)exp [ni,m,niA)] 
Ao(t)exp[r72,m,z7i(2)] 
exp{gi^rn,i) 
exp(O) 




( 10 ) 


where 7i(l) = 1 and 7 i( 2 ) = 0 by convention. 
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Finally, all the above three peeling criteria statistics can be used to maximize the differences in survival 
outcomes between two consecutive boxes and Bm,i of the box induction/peeling sequence. This 

leads to the derivation of the corresponding box rate of increase estimate r(m, /), at step (m, 1), according 
to equation m 


rLRT{m,l) 

fcHsim,l) 

rLHR{m,l) 


XLRTjm, 1) - XLRTjm, I - 1) 
Pm,I 

1) - AcHsjm, I - 1) 

Pm,I—I Pm,I 

^LHRjm, 0 — ^LHRjm, I — 1 ) 

Pm,l—1 Pm,I 


2-ii=l 2-ii=l 1 

Pm,1—1 Pm,I 

Vl,m,l 'ni,m,l—l 

Pm,1—1 Pm,I 


( 11 ) 

( 12 ) 

(13) 


2.2.4 Comments 

An alternative estimator is to consider the conditional probability Pg^rn,i{t\^i) = Pi’(Tj ^ t|xj e Bg^rn,i), 
which amounts to computing the Nelson-Aalen estimator Hg^rn,i{i) conditioning on the data in a given 
child box Although this probability is interpretable and estimable, it is by definition a function of 

an observed event (death) at time t (in a given child box Bg^m,i)- So, one would need to fix a meaningful 
survival time t. One could think, for instance, of the box median survival time (as is commonly done) 
or the box maximal event time. In addition, this would induce a likely loss of “power” in contrast to an 
estimator based on the global survival distribution. As a result, for a given choice of t, this probability 
may be easy to estimate in some boxes but not estimable in other boxes after sufficient peeling. 

Note that the Nelson-Aalen estimator is known to imply conservation of events 5^, that is in this 
case, the total number of deaths is conserved in each child box In fact, that is what the Cumulative 

Hazard Summary (CHS) statistic amounts to (see eq: [9]). 

A modified summary statistic derived from the Nelson-Aalen is also possible by normalizing Achs{'^^ 0 
to the total box sample size Ug. This would have the advantage of hedging against large versus small box 
bias. In our experience, this could be important in the case of discrete covariates. 

In addition to the above assumption on the censoring mechanism, the CPH-derived Log-Hazard Ratio 
or Relative Risk statistic to be used in equation [6] assumes proportional hazards, which may not be realistic. 
For this reason, this survival peeling criterion is referred to the reader as not preferred and left as a means 
of comparison to potentially better alternative survival peeling criteria described above (section I2.2.3p . 

2.2.5 Alternative Survival Peeling Criteria 

Further discussion of the use of the above estimators is found in our comparisons of numerical results 
fsection 14.3|) and in the discussion ([6]). Additionally, we mention below a few more alternative survival 
peeling criteria, although none of these is preferred nor implemented in our R package. 

I. It is common to estimate the hazard rate of the simplest parametric survival model (exponential 
survival model) by the parametric Maximum Likelihood Estimator (MLE). Using above notation 
and dropping further step subscripts (m,/) for simplicity, if we let Tj ~ Exp(A) for f e {1,... ,ng] 

The use of this simple parametric estimator of 
Since it is always estimable, 


then the parametric MLE is: \g{t) = ■ 

I-1 I- 

hazard rate for a box was originally proposed by LeBlanc et al. 47|, [45 
it could be used to maximize the box rate of increase r(m, 1) (eq. E]) at each step (m, 1). It also does 
come with likely the least variance. However, the underlying assumption of constant hazard rate over 
the duration of time makes it potentially unrealistic and therefore not preferred. 


The Log-Rank Score Test statistic for splitting in trees (see 37[) or their ensemble versions 40| is 
another potential criterion available. Note that if there are no tied event times, the Log-Rank Test 
and the Log-Rank Score Test statistics are identical and, unless there are a large number of tied 
times, will give very similar results. 


II 















Although some may argue that residuals are counter intuitive to use as a peeling criterion, others 
have used them. For instance, the use of Martingale residuals is strongly recommended by Kehl et al. 
(^ . Their claim is that they perform better than the deviance residuals, which are a transformation of 
the Martingale residuals correcting for long tails of the residual distribution. However, others have found 
that the deviance residuals lead to better rule induction results for bump hunting (Steve Horvath et al.’s 
personal communication and 0). Therneau et al have also found that using the deviance residuals 
as a splitting criterion in regression trees leads to better results than the Martingale residuals. 

1. Martingale Residuals Mi = 6i — yl(yi,Xj) = Si — ylo(^i) exp(T 7 '^Xj), for the iih observation, result from 
fitting an intercept-only Cox regression to the censored survival times. The idea is to use these as 
new (uncensored) outcomes in the model instead of time [^, where Si is the event indicator and 
ylo(hi) is a non-parametric estimate of the baseline cumulative hazard function for the entire sample. 


2 . 


for the zth observation, have a less 


Deviance Residuals Di = sign{Mi)^2 |^(^ilog tVij ) ~ 

symmetric distribution than Martingale residuals [69j]. LeBlanc and Crowley |45l | also demonstrated 
that (i) using deviance residuals in regression trees is similar to the survival tree methods presented 
by Segal 6^ and Ciampi et al. 11[, and that (ii) using deviance residuals is more efficient than using 


Martingale residuals with regression trees. 


2.2.6 Box End-Point Statistics 

Below is a summary of box end-point statistics of interest one can derive in our Survival Bump Hunting 
method. Each is defined for each step {m,l) and all are implemented in our R package PRIMsrc [0 : 


1. Log Hazards Ratios {LHR), denoted X{m,l) between the highest-risk group/box and lower-risk 
groups/boxes of the same generation. 

2. Log-Rank Test statistic (LRT), denoted x(m,/) between the highest-risk group/box and lower-risk 
groups/boxes of the same generation. 

3. Concordance Error Rate (CER), denoted 9{m,l) in the highest-risk group/box, that is a prediction 

performance metric taking censoring into account. For each step 9{m,l) = 1 — C{m,l), where 

C is Harrel’s Concordance Index for censored data 3^, a rank correlation U-statistic, to estimate 
the probability of concordance between predicted and observed survival times. 

4. Event-Free Probability (EFP), denoted PQ{m,l) or probability of non-event until a certain time 
T{m,l) in the highest-risk group/box (Figure [2] left). For instance, the Probability of Event-Free 
Survival (PEFS) or the Survival Rate (SR) are frequently used. However, Po{m,l) may not always 
be reached for a specified time T{m,l). In this case, we determine the limit end-point PQ{m,l) or 
Minimal Event-Free Probability {MEFP) and corresponding maximal time T'{m, 1), which are always 
observable (see Figure [2] left). 

5. Event-Free Time (EFT), denoted To(m,l) or time to reach a certain end-point probability P{m,l) 
in the highest-risk group/box (Figure [2] right). For instance, the Median Survival (MS) is frequently 
used to indicate the period of time where 50% of subjects have reached survival. However, TQ{m,l) 
may not always be reached for a certain probability P{m,l). In this case, we determine the limit 
end-point TQ{m,l) or Maximal Event-Free Time (MEET) and corresponding minimal probability 
P'{m,l), which are always observable (see Figure [2] right). 

6. Box characteristics: 


2p box edges 


f. {m,l),tUrn,l) 


pi 

box support (mass) j3{m,l) 


i=i 


• box membership indicator 7 (m, 1) 

7. Traces of Covariate Usage VU{m,l) and Covariate Importance VI{m,l) 

8. Kaplan-Meir curves of survival probability values with log-rank test permutation p-values p{m, 1) 
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Figure 2: Survival end points statistics used in Survival Bump Hunting at each step (m, 1) of the box generation. Left: Event- 
Free Probability Po and Minimal Event-Free Probability ('MEFPj Pq. Right: Event-Free Time To and Maximal Event-Free 
Time fMEFTj Tq. Subscripts {m,l) are dropped for simplification but understood. 

2 . 2.7 Estimation by Patient Recursive Survival Peeling 

The strategy employed here is a recursive peeling algorithm for survival bump hunting. Our “Patient 
Recursive Survival Peeling” method proceeds similarly ato which it is done in PRIM except for the box 
induction peeling/pasting criteria and the induction stopping rule (see section [2.1.4p: 


Algorithm 1 Patient Recursive Survival Peeling (annotated below w.l.o.g for a maximization problem). 

• Start with the training data T(i) and a maximal box Bi containing it 

• For m 6 {1,..., M}: 

1: Generate a box Bm using the remaining training data jC(rn) 

2: For / e {1,... , L}: 

— Top-down peeling: Generate a box 13^,1 by conducting a stepwise covariate selection/usage: 
shrink the box by compressing one face (peeling), so as to peel off a quantile oq of observations 
of a covariate Xj for j e {1,..., p}. Choose the direction of peeling j that yields the largest box 
increase rate r(m, 1) of the statistic used as peeling criterion between box B^^i-i and in 
the next generation: Log-Rank Test XLRT{jn,l), Cumulative Hazard Summary AcHsi'iT,l), 
Log Hazards Ratio XpHnitn, 1). The current box Bm,i-i is then updated: Brn,i = 

where 6^ ^ = argmax [r(m, 1)] 
ihm.l ) 

— Bottom-up pasting: Expand the box along any face (pasting) as long as the resulting box 
increase rate r(m, 1) > 0 

— Stop the peeling loop until a minimal box support $rn,L of Rm,L is such that it reached a 

minimal box support 0 ^ /3o ^ 1, expressed as a fraction of the data: (3m l ^ /3o 
- l^l + l 

3: Step #2 give a sequence of nested boxes where L is the estimated number of 

peeling/pasting steps with different numbers of observations in each box. Call the next box 
Rm+i = Bm,L- Remove the data in box Bm from the training data: T(m+i) = 

4: Stop the covering loop when running out of data or when a minimal number of observations 

remains within the last box Byj, say 13m ^ /3o 
5: m <— m -I- 1 

• Steps 7)^1 - #5 produce a sequence of (not necessarily nested) boxes {Bm}m=i: where M is the 
estimated total number of boxes covering 

• Collect the decision rules of all boxes {Bm}m=i ™to a simple final decision rule TZ of the solution 

region R of the form: TZ = |Jm=i '^rn, where TZm = Cljeji^j ^ giving a full description of 

the estimated bumps in the entire input space 
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Cross-Validation for Recursive Peeling Methods and a Survival/Risk 
Outcome 


3.1 Split-Sample-Validation 

3.1.1 Setup 

We previously tested the possibility of finding survival bumps in a small dataset, namely the Veteran’s 
Administration lung cancer trial data from Kalbfleisch and Prentice [i^. We could unravel interesting 
subgroups of patients with a poor survival time that could be characterized by a set of descriptive rules 
on the predictors including treatment intervention. Typically, this was indicative that an alternative 
intervention therapy could be required for these non-responders. While this approach showed promising 
results, it remained naive in that possible issues of bias and overfitting were not kept in check by model 
validation. 

Assessment of model performance (e.g. prediction accuracy) requires the use of separation of the whole 
data C between a “training set” used to build a model and an independent “testing set” T* used 
to assess model performance. To do so, the Split-Sample Validation technique (a.k.a Full Validation) is 
possible. Using this approach, a model is entirely developed on the training set Then, samples in 

the independent testing set C} are used to determine the error rates. The samples in the testing set are 
never to be used for any aspect of model development such as variable selection and calibration and can 
therefore be used to check model performance [51 


m 


Here, cross-validation of box estimates should include all steps of the box generation sequence 
i.e. for the (outer) coverage loop of our “Patient Recursive Survival Peeling” method (Algorithm [T]) , each 
step of which involves a peeling sequence of the (inner) box peeling/induction loop. However, 

for simplicity, cross-validation designs of box estimates and resulting decision rule TZm are shown below for 
fixed m e {1,..., M} of the complete box sequence subscript m is further dropped. 

Without loss of generality, fix m = 1 (first coverage box). 

3.1.2 Estimated Box Quantities of Interest 

Using previous notation, if we let Bi be the Zth trained box and fii be its estimated box support for 
I e {1,...,L} of a box peeling sequence then the test-set mean estimate of a box quantity of 

interest q for the Ith peeling step is indexed by the Zth test box support fij as follows: 


QiPl) = 


1 






X 6 B, 


(14) 


where g(-) is the functional corresponding to the quantity q, b ^x* 6 Bf^ and gl,x(,n* are 

test-set quantities. Useful test-set quantities for the highest-risk box are box end-point statistics mentioned 
in section 12.2.61 


3.2 A'-fold Cross-Validation 

3.2.1 Resampling Design - Notation 

Although using a fully independent test set for evaluating a predictive bump hunting model is always 
advisable, the sample size n in discovery-based studies is often too small to effectively split the data into 
training and testing sets and provide accurate estimates [^, | 
such as AT-fold Cross-Validation (CV) are required jl, M|. 


64l |. In such cases, resampling techniques 


In resampling based on full AT-fold cross-validation, the whole data £ is randomly partitioned into 
K approximately equal parts of test samples or test-sets (£i,... ,£k, ■ ■ ■ ,£k)- For each test-set Ck, for 
k e A'}, a training set is formed from the union of the remaining K — 1 subsets: = 

£ \ The process is repeated K times, so that K test-sets £k are formed of about equal size and K 
corresponding training subsets £(k), for k e {1,Kj. Typically, K e {3,, 10}. The training samples 
are approximately of size n(K — 1)/K and the test samples are of size n* n/K. 
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3.2.2 Cross-Validation Techniques 

Recently, we described a cross-validation technique for recursive peeling methods in a survival/risk setting 
14l |. The subject of this section is to give a more in-depth development of this strategy and compare it to 
standard cross-validation techniques. 

There are issues when dealing with RT-fold CV: first, how to cross-validate a simple peeling trajectory 
{Bi}fL^ and related statistics is not straightforward; second, how to cross-validate survival curve estimates 
and related statistics is also not intuitive (see also [65|); third, the data splitting in the cross-validation 
step should balance the class distributions of the outcome (i.e. here the censoring rate) within the cross- 
validation splits, which we call “stratified random splitting by conservation of events”. So, regular ii'-fold 
cross-validation is not directly applicable to the joint task of box decision rules making by recursive peeling 
and survival estimation. One must design a specific cross-validation technique(s) of survival bump hunting 
that is amenable to this joint task. 

Hence, we propose two techniques by which iT-fold cross-validation estimates can be computed: 

• Averaging Technique: Estimations are first computed for each “in-box” test subset samples, then 
averaged over the cross-validation loops of random splitting to give the “Averaged Cross-Validation” 
estimates (see details in section below [3]3]). 

• Combining Teehnique: All “in-box” test subset samples are first collected from all the cross-validation 
loops of random splitting to build a combined test “in-box” and corresponding combined test “in¬ 
box” samples to compute once the final “Combined Cross-Validation” estimates (see details in section 
below [ 


Note that, unlike in the averaging technique, cross-validated combined estimates are computed on test 
samples of size n instead of n* « n/K, which could be an advantage in the case of tiny sample size n. 
In our numerical analyses, both strategies were compared with each other and with the situation of no 
cross-validation (see result section 14.3p . 

Finally, to account for the high variability of cross-validated estimates 0J 0, IHH , we iterate each cross- 
validation procedure several times over some replicates B (typically, B ^ 10) to average the estimates 
and reduce their variance. This so-called “Replicated Cross-Validation” approach is further detailed in 
section below ()3.6p . Also, aside the Split-Sample-Validation, mentioned in the previous section (|3.ip . other 
resampling techniques are available, which we left for discussion (section!^. 

3.2.3 Model Peeling Length Optimization Criterion 

In model tuning, a trade-off between under-fitting and over-fitting can be achieved by optimizing an 
empirical function or objective criterion that takes censoring into account using cross-validation. The 
“optimization criterion” that we derive below is adapted to the task of of fitting a survival bump hunting 
model by recursive peeling with a survival outcome. Specifically, we tune a peeling model by optimizing 
its complexity, that is, the final length or number of peeling steps L of the peeling sequence. The reason 
is that, for a given set of variables/covariates, the final length L of a peeling model only depends on the 
peeling meta-parameters ao (assumed fixed here) and /3o (see section [2.1.411 . In fact, an upper bound on 

log(/3o) 


the length of all possible peeling trajectories is given by Lag = I logfi-ao) I (s®® M for details). So, a 
cheaper cross-validation can be achieved on L only rather than on ao and /3o simultaneously. 

Assuming m fixed (see step of Algorithm [T]) and dropping subscript m for simplification, the process 
of model building is repeated K times for A; e {1,..., AT} as follows. First, let l{k) denote the Ith peeling 
step in the A:th trajectory for Z 6 {1,..., L{k)} and A: 6 {1,... ,K}, where L{k) denotes the final length of 
a trained peeling model. Note that L{k) ^ for all k e {!,..., Ai}, but, in general this inequality 

is strict for large enough sample sizes. Let be the trained box of support of the box peeling 

sequence that is constructed from training set leaving out the test-set during all aspects 

of model building including covariate selection. 

Second, once a resulting trained decision rule, abbreviated and box definition estimates are gener¬ 
ated from each training set T(fc), cross-validated estimates of box end-points statistics (described in l2.2.6p 
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are computed using the left-out test-set Ck- Three of these are the Log Hazard Ratio {LHR), Log-Rank 
Test (LRT) and the cross-validated estimate of prediction performance, namely the Concordance Error 
Rate {CER) that is obtained by calculating the test-set error rate using the left-out test-set Ck- 

In the subsequent sections, we denote by superscript cv any cross-validated estimate on the test-set 
Ck- Since peeling lengths L{k) are not necessarily equal for all k e iL}, we use the following 

cross-validated maximum peeling length over the K trajectories: 

LZ =. [L{k)\ (15) 

After K rounds of training and testing are complete and (averaged or combined) test prohles of LHR, 
LRT or CER estimates are determined for each step / e {1,..., L^}, model tuning is done by determining 
the optimal peeling length of the peeling sequence. To that end, one uses the maximization of the 
(averaged or combined) test prohles of LHR and LRT or the minimization of the (averaged or combined) 
test prohle of CER as criterion. Formally: 


T CD 

L = argmax 

A™(0 

or L^'" = argmax [x'^^(0] or = argmin 

d“(0 






where A™(/) is the cross-validated LHR) in the high-risk box at step I, is the cross-validated LRT 

between the high vs. low-risk box at step I and is the cross-validated CER between high-risk box 

predicted and observed survival times at step 1. 

Depending on the desired degree of conservativeness, the usual one-standard-error rule [ 3 ^ may be 
applied in combination with the prohles minimizer or maximizer to get smaller estimates corresponding to 
one standard-error below the maximum of LHR and LRT or standard-error above the minimum of CER. 
In the subsequent sections, we denote by superscript cv any cross-validated estimate on the test-set Ck- 

3.3 iL-fold Averaged Cross-Validation 

In A-fold Averaged Cross-Validation, the averaged cross-validated estimate of a box quantity q at the 
l{k)th. step of the box peeling sequence is based on the test samples falling within the trained box Bn^k)- 
The averaged cross-validated estimate of q at step I is simply computed by averaging the estimates obtained 
from all test boxes computed over all K cross-validation loops. Specihcally, each test-set Ck is used to 
estimate the l{k)th. test box membership indicator 
7 ;^^) from the model grown on the training set /l(fc). 

The corresponding test box support is directly 
derived from by computing the fraction of test 
data falling within the trained box The l{k)th 

estimate of the box quantity q is indexed by the corre¬ 
sponding test box support For each training set 

C(^k): a trajectory curve q{x) of a box quantity q is de- 
hned as a piecewise constant curve, evaluated at the 
/(fe)th test box support so that each trajectory 

curve is: g(x) =q0\^k)) ^ ^ (Fig- 

ure[3]), where q{ldyk)) derived as in equation [TTl The 

averaged CV trajectory curve (f''“{x) of length L™ is 
simply the average of the K trajectory curves over the 
K cross-validation loops: (f''"{x) = 

for P\k)+i ^ ® ^ M{ky 




0.0 0.2 0.4 0.6 0.8 1.0 

Box mass 

Pl(k)+\ Pl(k) 

Figure 3: Example of decreasing trajectory curve q(x) of 
a box quantity q. Notice the piecewise constant curve of 
q(x) for ^ 2 ; ^ Pi(k)- Ty convention, we define 

do = di(fc) = do a.nd di(fe)+i = 0. 
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Formally, we show below how things are computed from an initial set of K trained peeling trajectories: 


For the /cth 

First peeling 


Last peeling 

training set 

step in the /cth 

direction of peeling 

step in the /cth 

B(k)- 

trajectory 


trajectory 


/cth training 
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From the K test trajectories, one derives first the “Averaged CV” optimal peeling length of the peeling 
trajectory, according to the optimization criterion for model selection as in equation 1161 


T CD 

L = argmax 

A™(0 

or = argmax or L'^'^ = argmin 







Then, one derives “Averaged CV” estimates for each step / 6 {1,... as follows: 

• The “Averaged CV” box definition, which can be written as the outer product of | J| intervals as in 
equation [3l is formed by taking the rectangular box where each of the 2| J| edge is averaged over the 
K cross-validation loops: 




X [tj ii t^i] where for each j e J, 
jeJ 


t-, = ave 

ke{l,...,K} 

tX = ave 

ke{l,...,K} 




• The “Averaged CV” box membership indicator is formed by counting the data within the “Averaged 
CV” box: 




/[x, € B“(()] 


n 

2 = 1 


The “Averaged CV” box support is computed as the fraction of data within the “Averaged CV” box: 




1 

n 'H 
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• The “Averaged CV” box end-point quantity q is taken as the averaged CV trajectory curve evaluated 
at the test box support 

«“(') = K S « (AU)) • 

k=l 

9 (^i(k)) = ~Jir~ Ij ^ ™ equation m 

The latter is done for the “Averaged CV” box estimates of: (i) The Log Hazard Ratio (LHR) in 
the high-risk box: (A*(A:))’ Log-Rank Test (LRT) between the high vs. 

low-risk box: x™(0 = X Concordance Error Rate {CER) between high-risk 

box predicted and observed survival times: (A%))’ Minimal Event-Free 

Probability {MEFP): ’P^{1) = I]f=i (4%)); (v) The Minimal Event-Free Time {MEET): 

3.4 /L-fold Combined Cross-Validation 

In iL-fold Combined Cross-Validation, for each loop, samples from the training set are used to train a 
peeling model of a certain length, then samples from the test-set are used to determine the “in-box” test- 
set samples falling into the trained box. Eventually, all “in-box” test samples are combined together and 
all “out-of-box” test samples are combined together as well. So, in AT-fold Combined CV, estimate of a box 
quantity q is computed once on the collective test-set “in-box” samples, formed over the K cross-validation 
loops. This allows the estimation of box quantities and box survival distribution curves for both “in-box” 
and “out-of-box” samples. 

Specifically, each test-set is used to estimate the test box membership indicator 7 *^^^ from the 
model grown on the kth training set The Rh combined cross-validated test box membership indicator 
is formed once by taking the vector concatenation of all the cross-validated test box membership 
indicators { 7 *^^^}^^ over the K cross-validation loops. The corresponding /th combined cross-validated 

test box support /3^^{l) is then directly derived from 

The combined cross-validated estimate of a box quantity q at the /th step of the peeling trajectory is 
then computed once from the combined cross-validated test box membership indicator and indexed 

by the corresponding test box support f3{iy^. Here, the combined cross-validated trajectory curve qk{x) is 
defined as the piecewise constant curve of length L^, evaluated at the /th combined cross-validated test 
box membership indicator 7 ™(/). 

Formally, we show below how things are computed from an initial set of K trained peeling trajectories 
(where || denotes the concatenation operator). 
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From the K test trajectories, one derives first the “Combined CV” optimal peeling length of the peeling 
trajectory, according to the optimization criterion as in equation 1161 


T Cl) 

L = argmax 

A™(0 

or L™ = argmax or L™ = argmin 







Likewise, from the K test trajectories, one derives “Combined CV” estimates for each step I e 
{1,..., L™} as follows: 

• The “Combined CV” box membership indicator (Boolean n-vector) is formed by vector-concatenation 
of all the test box membership indicators over the K cross-validation loops: 

7“ 70 - [7”(0]?.i - II 7m- II 

fc=l k=l 


/[x( 6 


• The “Combined CV” box definition, which can be written as the outer product of | J| intervals as in 
equation [3l is formed by taking the rectangular box (2| J| edges) circumscribing all the “in-box” test 
samples over the K cross-validation loops: 




X [tj ii where for each j e J, 
jsJ 


tj,i = .Ki’ * e {1 ,..., X}: x\j e Buj,)] 
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• The “Combined CV” box support is computed as the fraction of data within the “Combined CV” 
box: 

n 

1=1 

• The “Combined CV” box end-point quantity q is taken as the result of the functional q{-) evaluated 
at the Zth “Combined CV” test box support 


ni) 


r (0 


The latter is done for the “Combined CV” box estimates of: (i) The Log Hazard Ratio (LHR) in 
the high-risk box: = X j (h) The Log-Rank Test (LRT) between the high vs. low-risk 

box: = X (/)j; (ii) The Concordance Error Rate (CER) between high-risk box predicted 

and observed survival times: = 0 ; (iv) The Minimal Event-Free Probability (MEFP): 

(v) The Minimal Event-Free Time {MEET): {1) = 


3.5 /X-fold Cross-Validation of P-Values 

The log-rank test statistic (e.g. Xi a two group comparison) is a classical measure to evaluate the 
statistical significance of separation between survival curves. However, the null distribution of the log-rank 
test is not valid for cross-validated curves because the observations used to cross-validate the curves are 
not independent anymore. 

For each step I e {1,..., we generate the null distribution of the cross-validated log-rank statistic 

X'^’'(“)(0 ® e H} by randomly permuting the correspondence of survival times and censoring 

indicators of the data and by computing the corresponding cross-validated survival curves and cross- 
validated log-rank statistic for that permutation. By repeating A times the entire iL-fold cross-validation 
process for many random permutations (typically A = 1000), one generates a null distribution of the 
permuted log-rank statistics (annotated below w.l.o.g. for the case of “Combined CV”): 

The proportion of replicates with log-rank statistic greater than or equal to the observed statistic 
for the un-permuted data is the statistical significance level for the test. Log-rank test permutation p-values 
are then calculated for each step I e {1 ,..., as: 

#“(0 = ^ 1 ;/ 

a=l 

These p-values may be discrete: the precision depends on the number A of random permutations and 
the lower bound 1/A may be reached in practise. 


> ni) 


(17) 


3.6 Replicated iC-fold Cross-Validation 

Typically, iX-fold cross-validation is repeated H = 10 — 100 times and resulting replicated cross-validated 
estimates are somehow “averaged” over the replicates. We denote these by the superscript rev and each 
replicate by the superscript cv(b), for b 6 This is done for either cross-validation technique 

(shown below w.l.o.g. for the case of “Combined CV”). 

Formally, one first derives the “Replicated CV” maximal peeling length of the peeling model from the 
cross-validation replicates, denoted To do so, one uses the cross-validated maximum peeling length 
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of the peeling trajectory, defined in equation [TH for 6 6 {1,, B}. Formally, the “Replicated CV” 


maximal peeling length is calculated as the ceiling-mean of the cross-validated quantities L- 

, B 


cv(h) ^ 
m 


jrcv _ 


6=1 


cvib) 


(18) 


Next, depending on the optimization criterion used, one gets the “Replicated CV” optimal length of 
the peeling trajectory: 

Jrcv ^ argmax [A^™(^)] or = argmax or = argmin (19) 

where each optimization criterion: (i) The Log Hazard Ratio (LHR) in the high-risk box: (ii) 

The Log-Rank Test (LRT) between the high vs. low-risk box: and (iii) The Concordance Error 

Rate (CER) between high-risk box predicted and observed survival times: is taken as the average 

estimate over the B replicates for each step I e {1,..., as follows: 


-i S -t B -t B 

-yrcv^l~^ = _ ^ ^ QTCV ^ ^ 

^ 6=1 ^ 6=1 ^ 6=1 


( 20 ) 


Using the above “Replicated CV” optimal length of the peeling trajectory, one finally derives “Repli¬ 
cated CV” box end points from the B replicates for each step Z e {1,..., as follows: 

• The “Replicated CV” box definition (2| J| edges) is taken as the average-box over the B replicates: 

( 21 ) 


R’'"’'(Z) = ave 

6s{l,..,B} 


where ave(-) denotes the averaging function by edge or dimension j for j e J. 


The “Replicated CV” box membership indicator (Boolean n-vector) is taken as the average-box mem¬ 
bership indicator, observed to be nearly equal to the point-wise majority vote over the B replicates: 


= [/[x. e 
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( 22 ) 


2 = 1 


The ‘“Replicated CV” box support is taken as the average estimate over the B replicates: 

1 ^ 


(23) 


6=1 


Other “Replicated CV” box end-point quantities q estimates, taken as the average estimate over the 
B replicates: 


1 ^ 

^6=1 


(24) 


This is done for: (i) The Minimal Event-Free Probability (MEFP): and (ii) The Minimal 

Event-Free Time (MEET): Z^'’“(Z). 


21 










4 Numerical Analyses 


4.1 Simulation Design 

The p-dimensional covariates Xj = for i e n}, were identically and independently 

drawn from either: a (i) p-multivariate normal distribution with mean vector and variance-covariance 
matrix .17: Xj ~ Np{fx,U)] or (ii) from a p-multivariate uniform distribution on the interval [a, 6 ]: Xj ~ 
Up{a,b). 

Simulations were carried out according to the assumptions stated in section [ 2 . 2.2 1 Simulated realizations 
of true survival times Tj were identically and independently drawn from an exponential distribution with 
rate parameter A (and mean ^): Tj ~ Exp (A). Simulated realizations Ci of true censoring times were 
identically and independently sampled from a uniform distribution: Ci ~ 17(0, u) with u > 0, so that 
approximately 100 x 7 r(%) of the simulated realizations of observed survival times Yi = min(Tj,C'i) were 
censored, where vr e {0.3,0.5,0.7}. Finally, the simulated realizations of observed event (non-censoring) 
random variable indicator were 5i = I{Ti ^ Ci). 

To simulate survival models with various types of relationship between survival times (or hazards) 
and covariates (i.e. variable informativeness) including saturated regression models and noise models, the 
individual hazard rate parameter Aj was simulated as an exponential regression function of individual 
covariate x^: Ai(t|xj) = Ao(t) exp [//(xj)], where the regression function r/(xj) can take different forms 
depending on the simulated survival model. In summary, our simulation was done using the following 
parameters (see documentation in our R package for more details 0 ): 

• Xj ~ Up{0, 1) with n = 250 and p = 3, or Xj ~ 7Vp([0 0 0]^, a‘^1) with n = 100 and p = 1000. 

• by characterization of the first coverage box Bi (i.e. for m = 1 ), using constrained/directed peeling, 
without pasting and with meta-parameter values (aoj/^o) ^ {(0.10,0.05)}. 

• with censoring rate tt = 0.5 and five concurrent simulated survival models, where n > p and n « p, 
representing low- and high-dimensional situations, and where the regression function is as follows: 


— In simulated models ^1-4, ? 7 (xj) = rj^Xj with regression parameters rj = [r/i... Oj ... rjp]'^, for 
j e 0 u {l,...,p}: 


Low-dim. Saturated Model #1: 
Low-dim. Un-saturated Model #2: 
Low-dim. Noise Model #3: 
High-dim. Un-saturated Model ^4: 


n = 250,p = 3 
n = 250,p = 3 
n = 250,p = 3 
n = 100 , p = 1000 


77 = [12 - 15 - 5]^ 

77 = [12 - 15 O]"^ 

= [0 0 0 ]"^ 

77 = [771 ... 77100 0 ... 0]'^ 


In simulated model a Low-dim. Saturated Model with n = 250 and p 
* 77 ^Xj with regression parameters 77 = [12 —15 — 5]^ 

Ui ~ 17(0,1) 


Vi^i) = 


= 3, where 
for Xj e i? 
for Xi ^ R 


where R = [0.7,1] x [0,0.2] x [0,0.4] is an arbitrary box in 

• using K = 5-fold cross-validation, A = 1024 for the permutation 75 -values and B = 128 independent 
replications. 


4.2 Summary of Outputs 

We explain below how the main diagnostic and descriptive output plots are used. 
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4.2.1 Cross-Validated Tuning Profiles 

Cross-validated tuning profiles plot values of the box end-points statistics (section I2.2.6P Log Hazard Ratio 
(LHR), Log-Rank Test (LRT) or Concordance Error Rate (CER), depending on the optimization criterion 
chosen, as a function of peeling length or peeling steps of the peeling trajectory (model complexity). A 
peeling step includes step 7)^0 corresponding to the situation where the starting box covers the entire 
test-set data Ck before peeling (Algorithm [1]) . These statistics are used internally or interactively to get 
the “Replicated CV” optimal length of the peeling trajectory: Z''™ fsection [3.61) . In order to successfully 
determine the profiles minimizer or maximizer (section 13.2.31) . the cross-validated tuning profile should be 
approximately non-monotone up to sampling variability (Figure U]). In addition, one expects an inflation 
of variance of cross-validated point estimates towards the right-end of the cross-validated tuning profile 
corresponding to an increase in overfitting and model uncertainty for more complex models (Figure [U 
Supporting_Figures [H O [3]). 

The choice of the optimization criteria for controlling the peeling length is crucial. Two typical situa¬ 
tions of failure of cross-validation can happen from the cross-validated tuning profiles of the box end-point 
statistics: either an extremum cannot be reached in the profile before the peeling sequence runs out of 
data, or the prohle is essentially flat due to noise or the absence of any effect in the data (Figured]). In 
the former case, this results in excessive peeling steps and cross-validated values of optimal peeling lengths 
(eq. [T^ at, or near, the maximal peeling lengths (eq. flH]) . In the latter case, this results in 
un-reliable optimal peeling lengths Z'’™ that can take any value between the [l,Z(j(;’^] boundaries. In both 
cases, this leads to likely under-fitted or over-fitted models (see asterisk-annotated models in Table dj)- 


Typical Successful 


Typical Failed 


Typical Failed 



Peeling Steps 




Figure 4: Illustrations of typical successful (left) and failed (center and right) cross-validated tuning profiles of box end-point 
statistics. Left: Successful peeling stops with a “Replicated CV” optimal peeling length (see eq. m reached within 
the [1, Zm’'] boundaries of possible peeling lengths; Center: Failure to reach a maximum before running out of data; Right: 
Failure to reach a reliable maximum because of a Hat proHle. The “Replicated CV” optimal peeling length of the peeling 
trajectory is shown in each plot with the vertical black dashed line. Each colored proHle corresponds to one of the replications 
(B = 128). The cross-validated mean prohle of the statistic used in the optimization criterion is shown by the dotted black 
line with standard error of the sample mean. 


4.2.2 Peeling Trajectories 

Cross-validated peeling trajectories are estimated by step functions of the covariates box cuts as a function 
of box support/mass (Figures O [7|). They are read from right to left as they track the top-down peeling of 
the box induction process (peeling loop) of our “Patient Recursive Survival Peeling” method (Algorithm 
dj). These peeling trajectories are, up to sampling variability: 

• Monotone (increasing or decreasing) functions for each input covariate Xj, for j e {1,... ,p}. 

• Non-monotone (increasing then decreasing) functions for LHR A'’™(Z). 

• Non-monotone (increasing then decreasing) functions for LRT 

• Non-monotone (decreasing then increasing) functions of CER 0”'^^(Z). 
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• Monotone decreasing functions for MEFP {1). 

• Monotone decreasing functions for MEET (1). 


4.2.3 Trace Curves 

Cross-validated trace curves of covariate importance and covariate usage are estimated by piece-wise linear 
and step functions, respectively, as a function of box support/mass (Figures [H [8|). Similarly to peeling 
trajectories, they are read from right to left. Trace curves of covariate importance show on a single plot: 
(i) the amplitude of used covariates, (ii) the order (prioritization) with which these covariates are used, 
and (iii) the extent of the number of peeling steps by which each covariate is used. Covariate traces are 
reminiscent of the concept of variable selection from the fields of decision tree and regularization, that is: 


In “Variable Importance”, a prediction-based statistics borrowed from the existing theory of decision 
trees and their ensemble version [^. 

In “Selective Shrinkage” of variable coefficients /parameters from the existing theory of regularization 
and variable selection (e.g. LARS 25], Lasso [3], Elastic Net 74] and Spike & Slab [4l|). 


4.2.4 Survival Curves 

Each subplot of Figure (U [H] and [13] corresponds to a peeling step of our Patient Recursive Survival Peeling 
method for a tested model and cross-validation technique (including none). Each subplot shows cross- 
validated Kaplan-Meir estimates of the survival functions, as a function of survival time, of both “in-box” 
(red) and “out-of-box” (black) samples, corresponding respectively to the high-risk and low-risk groups. 
Each subplot also displays the corresponding step number along with cross-validated Log Hazard Ratio 
(LHR), Log-Rank Test (LRT) and log-rank permutation p-value p™{l) of survival distribution separation 
(see section [33|). A single survival curve always exists at Step corresponding to the situation where 
the starting box covers the entire test-set data Ck before peeling (Algorithm d]) . As the peeling progresses, 
the survival curves of “in-box” and “out-of-box” samples further separate until the peeling stops. 


4.3 Effect of Model Tuning 

4.3.1 Effect of the Optimization Criteria 

We first compare the effect of the three optimization criteria (Log Hazard Ratio LHR, Log-Rank Test 
LRT or Concordance Error Rate CER - section 13.2.31) used for model tuning and selection. Evaluations 
are reported on (i) the “Replicated CV” optimal peeling lengths Z”™ (eq. obtained from the cross- 
validated tuning profiles of the box end-point statistics (Log Hazard Ratio LHR, Log-Rank Test LRT or 
Concordance Error Rate CER), and (ii) on the cross-validated numbers of used covariates by the PRSP 
algorithm (see Algorithm [1]) out of the total number of pre-selected ones. Results are presented in Table 
m Supporting_Table [1] and Supporting_Pigures [H [2l [3] for the three peeling criteria used (Log Hazard 
Ratio LHR, Log-Rank Test LRT or Cumulative Hazard Summary CHS - section I2.2.3P as well as for 
our two cross-validation techniques (“Replicated Averaged CV” (RACV) vs. “Replicated Combined CV” 
(RCCV)), whether in low- or high-dimensional simulated survival regression models ^1, ^2, ^3 and ^4. 

From Table [T] and Supporting_Figures [H [21 [3l it results that both Log-Rank Test [LRT) and Con¬ 
cordance Error Rate [CER) optimization criteria give satisfactory results in low-dimensional simulated 
models (#1 and ^2), other than the simulated noise model (#3), where the peeling length is optimally 
pruned [L'^™ = 9 — 20). This is in sharp contrast to the Log Hazard Ratio [LHR) optimization criterion 
that fails to control the peeling length, regardless of the cross-validation technique or the peeling criterion 
used (L”™ = 26- 27). 

The situation differs in high-dimensional simulated models: the Concordance Error Rate (CER) appears 
to be the only optimization criterion that reliably controls the peeling length in simulated model #4, 
regardless of the peeling criterion or cross-validation technique used (Table [U Supporting_Figures[Il [2] [3|). 

Finally, note that the Concordance Error Rate CER) optimization criterion tends to yield slightly more 
conservative results than the Log-Rank Test (LRT) in terms of peeling length (Table[ll Supporting_Eigures 
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Table 1: Effect of peeling and optimization criteria as well as cross-validation techniques on the cross-validated tuning profiles 
of the box end-point statistics (Log Hazard Ratio LHR, Log-Rank Test LRT or Concordance Error Rate CER) and the resulting 
“Replicated CV” optimal peeling length L^““ (see eq. \19\) . Cross-validated optimal peeling lengths are reported for the 
combined effects of: (i) the three peeling criteria (by rows: Log Hazard Ratio (LHR), Log-Rank Test (LRT) or Cumulative 
Hazard Summary (CHS)), (ii) the three optimization criteria (by columns: Log Hazard Ratio (LHR), Log-Rank Test (LRT) or 
Concordance Error Rate (CER)), (Hi) the two cross-validation techniques (by columns: “Replicated Averaged CV” or RACV 
and “Replicated Combined CV” or RCCV), and (iv) the four tested simulation models (by rows: Model #1, #2, #3 or #4). 
Asterisks denote situations where reaches either a (quasi-)minimal or (quasi- )maximal value of optimal peeling lengths, 
corresponding to failed cross-validations / likely under-fitted or over-fitted models. 


Model #1 




Optimization Criterion 



LHR 

LRT 


CER 

RACV 

RCCV 

RACV RCCV RACV 

RCCV 

Peeling 

LHR 

26* 

25* 

20 

20 

10 

11 

LRT 

26* 

25* 

17 

20 

10 

10 

Criterion 

CHS 

26* 

26* 

14 

14 

09 

09 


Model #2 _ 

Optimization Criterion 


LHR 

LRT 

CER 

RACV RCCV 

RACV RCCV 

RACV RCCV 


Peeling 

Criterion 

LHR 

26* 

25* 

17 

17 

12 

12 

LRT 

26* 

24* 

10 

11 

10 

10 

CHS 

26* 

26* 

14 

14 

10 

10 


Model #3 




Optimization Criterion 




LHR 


LRT 


CER 



RACV RCCV 

RACV RCCV RACV RCCV 

Peeling 

Criterion 

LHR 

23* 

01 

to 

CO 

* 

01 

05 02 

LRT 

26* 

02 

26* 

02 

03 02 

CHS 

24* 

01 

23* 

02 

O 

o 

to 


Model #4 




Optimization Criterion 




LHR 

LRT 

CER 



RACV 

RCCV 

RACV 

RCCV RACV RCCV 

Peeling 

Criterion 

LHR 

17* 

09 

16* 

06* 

05 05 

LRT 

16* 

01* 

16* 

08* 

06 08 

CHS 

16* 

01* 

16* 

08* 

05 05 


[MED and number of used covariates (Supporting_Table[T]). Also, note that the Concordance Error Rate 
CER) has systematically less variance than the other Log-Rank Test {LRT) and Log Hazard Ratio (LHR) 
optimization criteria (Tabled! Supporting_Figures dl El E]). 

For the above reasons, we recommend using the Concordance Error Rate CER as optimization criterion 
in every situation or the Log-Rank Test LRT in low-dimensional situation only. 

4.3.2 Effect of the Peeling Criteria 

Overall, in all simulation models tested, the effect of the peeling criteria, used for model htting of the 
survival bump hunting model, is relatively marginal compared to that of the optimization criterion and/or 
cross-validation technique used for model tuning. 

However, for any combination of the optimization criterion and cross-validation technique used, both 
Log-Rank Test LRT and Cumulative Hazard Summary CHS peeling criteria tend to induce slightly shorter 
prohles and use less covariates than the Log Hazard Ratio LHR criterion (Table dl Supporting_Table dl 
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Supporting_Figures dl [21 [3]) . Moreover, the Cumulative Hazard Summary CHS peeling criterion tends to 
be slightly more conservative than the Log-Rank Test LRT (and the Log Hazard Ratio LHR) in terms of 
peeling length and number of used covariates, especially in high-dimensional models (Table [H Support- 
ing_Table[Il Supporting_Figures [Tl[2l[3|). 

So, we recommend using primarily the Cumulative Hazard Summary CHS and secondly the Log-Rank 
Test LRT as peeling criterion (in low or high-dimensional data) to reduce the risk of over-fitting (or 
conversely the Log Hazard Ratio LHR to avoid excessive conservativeness). 

4.3.3 Effect of the Cross-Validation Technique 

The difference of results between cross-validation techniques (including none) is striking in simulated 
noise model #3. Here, only the “Replicated Combined CV” (RCCV) cross-validation technique gives 
satisfactory results, regardless of the optimization or peeling criterion used. As expected in this situation, 
the model is extensively pruned = 1 — 2) using RCCV. The same consistency is not observed for 

“Replicated Averaged CV” (RACV) that fails to properly control the peeling length when the other Log- 
Rank Test {LRT) or Log Hazard Ratio {LHR) optimization criteria are used (Z”™ ^ L^^) (Table [H 
Supporting_Figures [U EJ [3]). Differences between cross-validation techniques are also significant in high¬ 
dimensional simulated models (Table [H Supporting_Figures [Tl[2l[3|). 

Using from now on a given peeling and optimization criterion in low-dimensional simulated models, 
we compare the performance of our two cross-validation techniques with each other (RACV vs. RCCV) 
and with the situation of no cross-validation (NOCV). Peeling trajectory and covariate usage/importance 
results are shown for model ^(^2 where it is possible to specifically assess effects of cross-validation on a 
noise/random covariate (X 3 ). Figures [5] and [ 6 ] show peeling trajectory profiles and covariate traces for 
model #2. Table [2] gives the corresponding rules. 


Xl covarlate trajectory 




Box mass 


MEFT trajectory 



I NOCV 

-RACV 

I— J -RCCV 

0.0 0.2 0.4 0.6 0.8 1.0 

Box mass 



Box mass 


MEFP trajectory 



LHR trajectory LRT trajectory CER trajectory 



Figure 5: Comparison of cross-validated peeling trajectories between situations when either cross-validation technique “Repli¬ 
cated Combined CV” (RCCV) or “Replicated Averaged CV” (RACV) and no cross-validation (NOCV) was done. Results are 
for simulated model #2 and the LRT statistic used in both peeling and optimization criteria. Compare the trajectory lengths 
between either cross-validation technique and in the absence of either one. Notice also the flat trajectory profile of covariate 
X 3 in the presence of either cross-validation technique (RACV or RCCV) as opposed to the situation where no cross-validation 
(NOCV) was done. 
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Clearly, both cross-validation techniques are effective in terms of (i) smoothing peeling trajectories out 
and (ii) pruning peeling trajectories off. Compare for instance results of simulation model ^ 2 : = 26 

without cross-validation (NOCV), = 10 with RACV and = 11 with RCCV (Figures [5] and O 
Table [2]). In fact, all cross-validated trajectory profiles in Figure [5] and covariate traces in Figure [ 6 ] stop at 
^TCV^I _ < 0.30 for RCCV and = 10) < 0.34 for RACV as compared to = 27) < 0.05 in 

the absence of cross-validation (NOCV). 



Covariate Usage 
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Figure 6 : Comparison of cross-validated trace plots of covariate importance VI(l) (top) and covariate usage VU{1) (bottom) 
between situations when either cross-validation technique “Replicated Combined CV” (RCCV) or “Replicated Averaged CV” 
(RACV) and no cross-validation (NOCV) was done. Results are for simulated model and the LRT statistic used in both 
peeling and optimization criteria. Compare the trace lengths between either cross-validation technique and in the absence of 
either one. Notice also the Sat trace of covariate X 3 about 0 in the presence of either cross-validation technique (RACV or 
RCCV) as opposed to the situation where no cross-validation (NOCV) was done. 

Notice from the peeling trajectories and trace plots of simulation model ^2 (compared to ^1} how 
a noise/random covariate (X 3 ) is effectively eliminated from the model after using either cross-validation 
technique (RCCV or RACV), while it is not in the absence of cross-validation (NOCV). In fact, xa’s RCCV 
and RACV peeling trajectories are mostly flat, that is, X 3 is unused in the decision rule (blue dashed curves 
in Figure [5|). Consistently, X 3 ’s RCCV and RACV covariate importance trace plots are mostly flat (top of 
Figure [ 6 ]) and X 3 ’s RCCV and RACV covariate usage trace plots show that X 3 is not used at all (bottom 
of Figure El). Similar conclusions are drawn with respect to simulated model #3 (compared to #1). 

The non-monotone behavior of the LRT and the overly large LHR values obtained in the non-cross- 
validated (NOCV) results of simulated model #2 (A^'^^(Z = 26) = 11.08) clearly reflect over-fitting and 
sub-optimal models. This is evident when comparing to the much more conservative values obtained from 
the corresponding cross-validated peeling profiles: ^'^'^{1 = 11 ) = 3.90 for RCCV and A^™(/ = 10) = 3.76 
for RACV (Figure [5] and Table [2]). This non-monotone behavior of LRT peeling prohle is precisely what 
allows us to use it in the optimization criterion. We suggest that this could be due to a greater sensitivity 
of LRT to small sample sizes at deep peeling steps. 

To further compare cross-validation techniques, we generated empirical distributions of various cross- 
validated estimates of box decision rules and box survival end-points/prediction statistics Isection [2.2.61) 
for each technique and end-point as a function of peeling steps. Distributions were obtained by generating 
B = 128 Monte-Carlo simulated datasets according to simulated model #1, where the LRT statistic was 
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Table 2: Comparison of cross-validated decision rules (upper Table) and box end points statistics of interest (lower Table) 
between situations when either cross-validation technique “Replicated Combined CV” (RCCV) or “Replicated Averaged CV” 
(RACV) and no cross-validation (NOCV) was done. For conciseness, only the initial and final decision rules (L^“"th step) are 
shown. Values are sample mean estimates with corresponding standard errors in parenthesis (NA in the case of NOCV, where 
no replication was performed - see manual of R package PRIMsrc for details M)- Step #0 corresponds to the situation where 
the starting box covers the entire test-set data Ck before peeling. Results are for simulated model #2 and the LRT statistic 
used in both peeling and optimization criteria. Notice the non-usage of covariate X 3 in the presence of either cross-validation 
technique (RACV or RCCV) as opposed to the situation where no cross-validation (NOCV) was done. 



Step 1 

Xl 

X 2 

X3 


0 

XI ^ 0.00 ( 0 . 00 ) 

X 2 ^ 1.00 (0.00) 

X 3 1.00 ( 0 . 00 ) 

RCCV 

1 

Xl 2 : 0.03 (0.00) 

X 2 ^ 0.88 ( 0 . 01 ) 

X 3 ^ 1.00 ( 0 . 00 ) 


11 

Xl 2 = 0.40 (0.06) 

X 2 ^ 0.62 (0.04) 

X 3 ^ 0.97 (0.05) 


0 

Xl ^ 0.00 ( 0 . 00 ) 

X 2 C 1.00 ( 0 . 00 ) 

X 3 ^ 1.00 ( 0 . 00 ) 

RACV 

1 

Xl 2 = 0.00 ( 0 . 00 ) 

X 2 C 0.88 ( 0 . 00 ) 

X 3 ^ 1.00 ( 0 . 00 ) 


10 

Xl 2= 0.42 (0.03) 

X 2 ^ 0.61 ( 0 . 02 ) 

X 3 ^ 0.96 (0.03) 


0 

Xl ^ 0.00 (NA) 

X 2 ^ 1.00 (NA) 

X 3 ^ 1.00 (NA) 

NOCV 

1 

Xl 25 0.03 (NA) 

X 2 s: 0.89 (NA) 

X 3 ^ 1.00 (NA) 


26 

Xl > 0.64 (NA) 

X 2 s: 0.34 (NA) 

X 3 ^ 0.51 (NA) 



Step 1 

n{l) 


Tr\i) 

pr\i) 

A'^RZ) 




0 

250 

(0.00) 

1.00 

(0.00) 

3.00 

(0.00) 

0.37 

(0.00) 

0.00 

(0.00) 

0.00 (0.00) 

1.00 

(0.00) 

RCCV 

1 

225 

(0.00) 

0.90 

(0.00) 

3.00 

(0.00) 

0.31 

(0.01) 

0.18 

(0.02) 

23.86 (1.14) 

0.43 

(0.00) 


11 

75 

(2.50) 

0.30 

(0.01) 

1.79 

(0.71) 

0.03 

(0.02) 

3.90 

(0.46) 

214.61 (33.56) 

0.26 

(0.01) 


0 

250 

(0.00) 

1.00 

(0.00) 

2.86 

(0.04) 

0.38 

(0.02) 

0.00 

(0.00) 

0.00 (0.00) 

1.00 

(0.00) 

RACV 

1 

225 

(2.50) 

0.90 

(0.01) 

2.86 

(0.05) 

0.31 

(0.02) 

1.19 

(0.02) 

4.84 (0.27) 

0.43 

(0.01) 


10 

85 

(2.50) 

0.34 

(0.01) 

1.01 

(0.34) 

0.05 

(0.02) 

3.76 

(0.41) 

43.64 (6.01) 

0.25 

(0.02) 


0 

250 

(NA) 

1.00 

(NA) 

3.00 

(NA) 

0.37 

(NA) 

0.00 

(NA) 

0.00 (NA) 

1.00 

(NA) 

NOCV 

1 

225 

(NA) 

0.90 

(NA) 

3.00 

(NA) 

0.31 

(NA) 

1.19 

(NA) 

23.43 (NA) 

0.43 

(NA) 


26 

12 

(NA) 

0.05 

(NA) 

0.01 

(NA) 

0.00 

(NA) 

11.08 

(NA) 

131.73 (NA) 

0.44 

(NA) 


used in both peeling and optimization criteria. The replication design accounts for two folds of variability: 
the one due to random splitting by cross-validation and the one due to sampling from the simulated 
model. Then, Box Coefficient of Variation (BCV) of decision rules (as defined in [l^) and coefficient 
of variation of box survival end-points/prediction statistics were computed and plotted as a function of 
peeling steps. Here, cross-validated coefficient of variation profiles do not show a consistent advantage of 
one cross-validation technique over the other considering all end-point analyzed. This remains true for a 
range of realistic sample sizes n e {50,100, 200} (Supporting_Figure 0]). 

Overall, our two cross-validation techniques, although not equivalent in design, give similar results 
on most profiles for the sample size and simulation models tested, confirming that both techniques are 
appropriate to the task in most situations. However, “Replicated Averaged CV” (RACV) appears to 
be less conservative than “Replicated Combined CV” (RCCV), especially in high-dimensional settings, 
which could be a problem if ones cares about reducing the risk of over-fitting. Also, RACV failed to 
prune simulated model #3 (Table [Hand Supporting_Figures[Il[2l[3l). This indicates that our RCCV cross- 
validation technique is more robust in noisy situations, possibly because RCCV uses larger test-set samples 
of size n to make estimations than RACV, which uses test-set samples of size n* n/K fsection 13.2.211 . 
For these reasons, we recommend using RCCV preferably to RACV. 
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4.3.4 Comparison Between Simulated Survival Models 

In line with the above guidelines (sections 14.3.21 and I4.3.3P , we used the following criteria and techniques 
in our numerical analyses for fitting and tuning/selecting our survival bump hunting model: (i) The Log- 
Rank Test LRT both as peeling and optimization criterion in low-dimensional settings; (ii) The Cumulative 
Hazard Summary CHS as peeling criterion and the Concordance Error Rate CER as optimization criteria 
in high-dimensional settings; and our “Replicated Combined Cross-Validation” (RCCV) technique. We 
compared the performance of our Survival Bump Hunting procedure in terms of peeling trajectories (Figures 
ElEl Table El Supporting_Figures El El Supporting_Table ^ and survival distribution curves (Figure E]) 
between all our models. 

Notice the striking differences in cross-validated peeling trajectories (Figure [7j) and covariate traces 
(Figure El): (i) when all covariates (xi,X 2 ,X 3 ) are noise (model #3), or (ii) when one covariate only (X 3 ) 
is noise (model ^ 2 ) or (hi) when all are informative (model #1). As expected, all peeling trajectories 
related to model ^3 are much shorter than in the other models, indicating an abortive PRSP procedure 
with little or no covariate usage during the peeling process (Figure [7]) nor involvement in the decision rule 
(Table E])- Similarly, one expects little or no usage of covariate X 3 in models ^2 and ^ 3 , as seen in their 
covariate peeling trajectories (Figured Table E])- 


Xl covarlate trajectory 




0.6 0.8 1.0 


model #1 

- - • model #2 

— model #3 
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- - • model #2 

— model #3 
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0.6 0.8 1.0 
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Figure 7: Comparison of replicated combined cross-validated results for the peeling trajectories between simulated models 
#2 and #3 for the “Replicated Combined CV” (RCCV) technique and the LRT statistic used in both peeling and optimization 
criteria. Notice the usage of all covariates (xi,X 2 ,X 3 ) in model as opposed to the selective usage of covariates (xi,X 2 ) in 
model and the abortive usage of all covariates in noise model #3. 

Consistent observations can be made from the cross-validated covariate importance and covariate usage 
traces of model #3. In fact, X 3 ’s covariate importance trace stops after only = 2 steps with box mass 
prcv^jrcv\^ ^ g niodel # 3 , as compared to = 20 steps with ^ gjg niodel #1 and 

j^rcv ^ ££ steps with ^ g gg £qj. niodel #2 (TableEland FigureEl)- Also notice the limited usage 

of covariate X 3 in the covariate usage traces of models #2 and #3 as compared to #1 and the fact that all 
cross-validated peeling trajectories in model #1 extend further than in other models #2 and ^3 (Figure 
[71 El and Table E]) • This is consistent with our simulation design in that all covariates in regression model 


29 














































































7^1 additively contribute to the hazards and to the separation of survival distributions. 
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Figure 8: Comparison of replicated combined cross-validated trace plots of covariate importance VI(l) (top) and covariate 
usage VU(l) (bottom) between simulated models #1, #2 and #3 for the “Replicated Combined CV” (RCCV) technique and 
the LRT statistic used in both peeling and optimization criteria. Notice the usage of all covariates (xi, X 2 , xs) in model #1 as 
opposed to the selective usage of covariates (xi,X 2 ) in model i^2 and the abortive usage of all covariates in noise model #3. 


Finally, Figure [9] shows the cross-validated Kaplan-Meir survival probability curves of the highest-risk 
group vs. lower-risk group in all low- and high-dimensional simulated models with their corresponding log- 
rang permutation p-values of survival distribution curve separation. The separation is especially evident in 
results of models #1, #2 and #4 in contrast to the overlap seen in model ^3. The permutation p-values 
are: = 20) ^ 9.7e — 5, = 11) ^ 9.7e — 5 and p‘^^{l = 8) ^ 0.046 for models #1, #2 and #4 

respectively, and p‘^'"{l = 2) ^ 0.1080 for model #3 (Figure [9]). 

Overall, Figures [71 El El Table El Supporting_Figures EHH and Supporting_Table [2] collectively support 
that our “Replicated Combined CV” (RCCV) cross-validation technique, when used with an appropriate 
combination of LRT and CER statistics as peeling and/or optimization criteria, is efficient at fitting and 
tuning a survival bump hunting model, whether in low- or high-dimensional settings. Moreover, we show 
that our PRSP algorithm (Algorithm [T]) successfully selects/uses a subset of (or all) the covariates that 
are informative (i.e. that truly enter into the model) in the box decision rules. 
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Table 3: Comparison of cross-validated decision rules (upper Table) and box end points statistics of interest (lower Table) 
between simulated models #2 and #3 for the “Replicated Combined CV” (RCCV) technique and the LRT statistic used 
in both peeling and optimization criteria. For conciseness, only the initial and final decision rules (L'^“"th step) are shown. 
Step #0 corresponds to the situation where the starting box covers the entire test-set data Lk before peeling. Values are 
sample mean estimates with corresponding standard errors in parenthesis. Notice the usage of all covariates (xi,X2,X3) in 
model #1 as opposed to the selective usage of covariates (xi, X2) in model #2 and the abortive usage of all covariates in noise 
model #3. 


Step 1 

Xl 

X2 

X3 


0 

XI 5s 0.00 (0.00) 

X2 ^ 1.00 (0.00) 

X3 ^ 1.00 (0.00) 

model 

1 

Xl 5= 0.03 (0.00) 

X2 ^ 0.88 (0.00) 

X3 ^ 1.00 (0.00) 


20 

Xl 5= 0.51 (0.04) 

X2 ^ 0.44 (0.06) 

X3 ^ 0.63 (0.08) 


0 

Xl ^ 0.00 (0.00) 

X2 ^ 1.00 (0.00) 

X3 sS 1.00 (0.00) 

model #2 

1 

Xl 5= 0.03 (0.00) 

X2 ^ 0.88 (0.00) 

X3 ^ 1.00 (0.00) 


11 

Xl 5= 0.40 (0.06) 

X2 ^ 0.62 (0.04) 

X3 ^ 0.97 (0.05) 


0 

Xl sS 1.00 (0.00) 

X2 ^ 1.00 (0.00) 

X3 ^ 0.01 (0.00) 

model #3 

1 

Xl s: 1.00 (0.00) 

X2 ^ 0.99 (0.01) 

X3 ^ 0.01 (0.01) 


2 

Xl s: 1.00 (0.00) 

X2 ^ 0.96 (0.05) 

X3 5s 0.06 (0.02) 


Step 1 

n(l) 


rp/' 

-^0 


pf 

-MD 


X ”“”( l ) 

r""(0 

e ”“”( i ) 


0 

250 

(0.00) 

1.00 

(0.00) 

3.00 

(0.00) 

0.42 

(0.00) 

0.00 

(0.00) 

0.00 

(0.00) 

1.00 

(0.00) 

model #1 

1 

225 

(0.00) 

0.90 

(0.00) 

3.00 

(0.00) 

0.36 

(0.00) 

1.06 

(0.03) 

19.21 

(1.13) 

0.43 

(0.00) 


20 

25 

(2.50) 

0.10 

(0.01) 

0.06 

(0.05) 

0.00 

(0.00) 

8.54 

(1.29) 

194.39 

(30.24) 

0.39 

(0.01) 


0 

250 

(0.00) 

1.00 

(0.00) 

3.00 

(0.00) 

0.37 

(0.00) 

0.00 

(0.00) 

0.00 

(0.00) 

1.00 

(0.00) 

model #2 

1 

225 

(0.00) 

0.90 

(0.00) 

3.00 

(0.00) 

0.31 

(0.00) 

1.18 

(0.02) 

23.86 

(1.14) 

0.43 

(0.00) 


11 

75 

(2.50) 

0.30 

(0.01) 

1.79 

(0.71) 

0.03 

(0.71) 

3.90 

(0.46) 

214.61 

(33.56) 

0.26 

(0.01) 


0 

250 

(0.00) 

1.00 

(0.00) 

2.99 

(0.00) 

0.41 

(0.00) 

0.00 

(0.00) 

0.00 

(0.00) 

1.00 

(0.00) 

model #3 

1 

225 

(2.50) 

0.90 

(0.01) 

2.99 

(0.00) 

0.40 

(0.01) 

0.21 

(0.14) 

0.86 

(0.79) 

0.49 

(0.01) 


2 

202 

(5.00) 

0.81 

(0.02) 

2.99 

(0.00) 

0.38 

(0.01) 

0.33 

(0.11) 

2.90 

(1.75) 

0.47 

(0.01) 
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Figure 9: Comparison of cross-validated Kaplan-Meir survival probability curves of the high-risk (red curve “in-box”) and 
low-risk (black curve “out-of-box”) groups in simulated models #2, #3 and #4. Results are for the “Replicated Combined 
CV” (RCCV) technique and the CHS statistic used as peeling criterion and CER used as optimization criteria. Left column: 
model #1, middle column: model #2, right column: model #3. For conciseness, only the last peeling step of the peeling 
sequence is shown for each model. Cross-validated LRT, LHR and permutation p-values of “in-box” samples are shown at the 
bottom of the plot with the corresponding peeling step for each method. P-values (1) ^ 9.7e —5 correspond to 1/lOth of the 
precision limit (see section 1X51) . Notice how the survival curves of “in-box” and “out-of-box” samples separates in models #i, 
#2 and #4 in contrast to the overlapping situation in noise model #3 with the corresponding significant and nonsignificant 
log-rank permutation p-value (1) of survival distribution separation. 
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4.4 Comparisons Against Other Methods 

4.4.1 Design and Choice of Varions Non-Parametric Snrvival Models 

Next, we compared our Survival Bump Hunting (SBH) procedure by our PRSP algorithm ([1]) to other 
competitive non-parametric survival models or methods in terms of survival and prediction end-points 
statistics. In all our performance analyses below, we used LRT in both peeling and optimization criteria 
and RCCV as our cross-validation technique. Comparisons include (i) Survival Bump Hunting by our 
PRSP method (Algorithm [D and [l3|h (ii) Regression Survival Trees (RST) by recursive partitioning 

I, (iv) Cox 


[il,|n|,[il,|3i,|4i, (hi) Random Survival Forest (RSF) by ensemble tree-based method [4C 

Proportional Hazard Regression (CPHR) [l^, (v) Survival Supervised PCA (SSPCA) [^, (vi) Survival 
Supervised Clustering (SSC) [^. 

The simulated survival model was according to simulated model ^T^lb, that is, by generating a box¬ 
shaped region R of the input covariate space with higher hazards than a uniform background (see SH). 
For comparisons, B = 128 repeated Monte-Carlo simulated datasets were used to generate empirical 
sampling distributions of cross-validation estimates of box statistics, survival end-points and prediction 
performance metrics (section 12.2.61) and make points and confidence intervals inferences. This replication 
design accounts for random splitting and simulated model sampling variabilities. 

For each method, an internal cross-validation was carried out for model fitting/training that was done 
by optimizing a specific empirical objective function of Goodness of Fit or Prediction Error measure on 
the corresponding test-set, such as: (i) maximization of the Log-Rank Test statistic or minimization of a 
Concordance Error Rate (SBH), maximization of the Deviance Residuals statistic (RST), maximization of 
the Concordance Index (RSF), maximization of the Likelihood Ratio Statistic between the reduced vs. full 
model (SSPCA), maximization of the Concordance Index (SSC). Then, cross-validation was used again to 
make estimations and predictions on the combined test-sets as described before (section [3.2.2p . 

Whether the goal is to make estimations or predictions, one wants to classify samples into two sur¬ 
vival/risk groups. However, unlike Survival Bump Hunting that inherently generates “in-box” and “out-of- 
box” groups, all other methods do not necessarily give directly two survival/risk groups. For comparisons 
purposes, one needs to come up with a calibrated way across all other methods to output two groups only. 
One way, shown to work well empirically, is by using the median survival time threshold (0). 


4.4.2 Comparison of End-Point Estimates 

Specifically, the trained models generate cross-validated fits from which cross-validated estimates of highest- 
risk/group support and survival end-points statistics (described in section [2.2.6p are made using the left-out 
test-set Ck- We report the results for all methods in Figure [TO] below. The figure shows the highest- 
risk/group end-points distributions of RCCV estimates of support and survival end-points statistics com¬ 
puted over B = 128 repeated Monte-Carlo simulated models #lb for all competitive non-parametric 
survival models under study. 

In Figure [11] below, a Kaplan-Meier estimate of RCCV survival probability curve is shown for each 
group and competitive non-parametric survival models under study from one replicate out of B = 128. As 
expected, it shows the extremeness of the survival distribution of the highest-risk box/group found by SBH 
as compared to all other methods. The box sample sizes in the highest-risk box/group (out of n = 250 
samples) were as follows for each method (and that replicate): SBH: usbh = 39, RST: urst = 105, RSF: 
nRSF = 124, CPHR: ucphr = 124, SSPCA: nssPCA = 124, SSC: nssc = 170. 

Overall, results from Figures [TO] and [H] point out that the highest-risk box/group found by SBH is, 
as expected, smaller in size (support) and more extreme in terms of survival hazards (LHR) or risks, 
with consistent smaller event-free end-point times (MEFT) and probabilities (MEFP) than any other 
method/model under study. This was the goal. However, we did not expect the separation of the estimated 
survival distributions to be necessarily larger. In fact, the distributions of the log-rank test statistics (LRT) 
are not signihcantly different between most methods (except SSC). Interestingly, the Concordance Error 
Rates {CER) are slightly higher for SBH than most methods (except SSC). So, it’s possible that the task 
of finding extreme survival/risks subgroups comes with some trade-off between achieving high levels of 
extremeness and high accuracy of survival time prediction. 
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Figure 10: Distributions of RCCV estimates of highest-risk/group end-points, computed over B = 128 repeated Monte-Carlo 
simulated models #1 and for all competitive non-parametric survival models under study. Comparisons include (i) Survival 
Bump Hunting (SBH), (ii) Regression Survival Trees (RST), (Hi) Random Survival Forest (RSF), (iv) Cox Proportional Hazard 
Regression (CPHR), (v) Survival Supervised PCA (SSPCA), (vi) Survival Supervised Clustering (SSC). In parenthesis is shown 
the criterion used for peeling or partitioning as it applies. For each SBH boxplot, the pair of horizontal dotted lines delineates 
the approximate (95%) confidence interval of the median. Results are for the “Replicated Combined CV” (RCCV) technique 
and the LRT statistic used in the optimization criteria. 


4.4.3 Comparative Prediction Performance 

The simulated survival model we drew from was according to model as follows: samples contained 
within a box-shaped region R of the input covariate space had increased risk/hazards while samples outside 
of it had a uniform background risk. The survival times were generated as in section (14.Ill , using the 
exponential distribution with random uniform censoring. The regression function for samples within R 
was as in model ^1. 

Due to the rule-induction nature of our “Patient Recursive Survival Peeling” method (Algorithm [T|), the 
box decision rule can be used as the classification rule. The cross-validated classification error is estimated 
from the discrepancies between the true and predictive classifications of the independent observations. 
Specifically, for each loop k e K] of the cross-validation, we compute a cross-validated estimate 

of the error by matching the SBH test-set “in-box” prediction samples to the true ones in the high-risk 
box-shaped region R of simulated model #lb using the left-out test-set Ck- The final cross-validated 
estimate of the Miss-classification Error Rate (MER) is given by the average of the cross-validated errors 
from the K models {'lZk}kZi generated from each loop of the cross-validation. This is repeated B times to 
get variability estimates. Note that CER and MER, although related, are distinct: the MER evaluates 
the accuracy to predict that a sample will fall into (or outside) the highest-risk box/group found by a 
method/survival model, whereas the CER evaluates the accuracy to predict a sample survival time. 

Prediction performances are then assessed using the usual accuracy metrics and Receiver Operating 
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Figure 11: Kaplan-Meier plots of RCCV survival probability curves for all competitive non-parametric survival models under 
study. Plots are illustrative of one replication out of B = 128. Comparisons include (i) Survival Bump Hunting (SBH), (ii) 
Regression Survival Trees (RST), (Hi) Random Survival Forest (RSF), (iv) Cox Proportional Hazard Regression (CPHR), (v) 
Survival Supervised PCA (SSPCA), (vi) Survival Supervised Clustering (SSC). In parenthesis is shown the criterion used for 
peeling or partitioning as it applies. The “in-box” legends (red) corresponds to the highest-risk box/group. Cross-validated 
LRT, LHR of “in-box” samples are shown at the top of the plot for each method (and that replicate). Results are for the 
“Replicated Combined CV” (RCCV) technique and LRT statistic used in the optimization criteria. 


Characteristics (ROC) for each method and compared between them. For binary classes, a common metric 
for assessing the prediction performance is prediction accuracy through the use of True- and False-Positive 
Rates TPR and FPR, respectively, also known as Sensitivity and 1 — Specificity. By definition, the 
True- and False-Positive Rates are defined as: 


TPR = Sensitivity = 


TP 


TP+ FN 
FP 

FPR = 1 — Specificity = 


FP + TN 


where TP, FP, TN, FN stands for True-Positive, False-Positive, True-Negative and False-Negative, re¬ 
spectively. The performance of classification is naturally assessed by measuring the accuracy of prediction, 
whereas the performance of ranking is commonly measured by takin g ( AUC), the Area Under the Re¬ 
ceiver Operating Characteristics (ROC) Curve {TPR versus FPR) |44l |. An AUC = 1 corresponds to 
a perfect classifier, while an AUC = 0.5 corresponds to all possible performances of a random classifier. 
Finally, we also report Pearson’s contingency table test p-values (after continuity correction) of inde¬ 
pendence between the observed versus predicted counts. Table [4] reports the classification performance 
results of various survival models/methods in terms of contingency table test, area under the ROC curve 
and sensitivity/specificity. 
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Table 4: Empirical contingency table test p-values {P.val), Area Under the Curve [AUC), Sensitivity (1 — FPR) and 
Specificity (TPR) for each method. Comparisons include (i) Survival Bump Hunting (SBH), (ii) Regression Survival Trees 
(RST), (Hi) Random Survival Forest (RSF), (iv) Cox Proportional Hazard Regression (CPHR), (v) Survival Supervised PCA 
(SSPCA), (vi) Survival Supervised Clustering (SSC). Values are median estimates with standard errors of the sample mean 
in parenthesis. In parenthesis, next to the method, is also shown the criterion used for peeling or partitioning as it applies. 
Results are for the “Replicated Combined CV” (RCCV) technique and the LRT statistic used in the optimization criteria. 


Method 

SBH (LRT) 

RST (LRT) 

RSF (LRT) 

CPHR 

SSPCA 

SSC 

P.val 

1 — FPR (Specificity) 
TPR (Sensitivity) 

MIC 

0.000 (0.32) 
0.800 (0.11) 
1.000 (0.38) 
0.899 (0.24) 

0.909 (0.42) 
0.947 (0.04) 
0.125 (0.17) 
0.533 (0.07) 

0.065 (0.24) 
0.516 (0.01) 
1.000 (0.14) 
0.757 (0.07) 

0.037 (0.29) 
0.516 (0.01) 
1.000 (0.26) 
0.758 (0.14) 

0.037 (0.34) 
0.516 (0.01) 
1.000 (0.25) 
0.758 (0.13) 

0.519 (0.33) 
0.373 (0.05) 
0.833 (0.21) 
0.591 (0.11) 


Table|3]shows that SBH has a better prediction performance on all metrics than any other method/model 
under study. This directly results from its better trade-off of Specificity and Sensitivity. Overall, this 
reflects the above results on comparative end-point estimates: SBH reaches out a more specific and smaller 
group of samples that is more extreme in survival hazards or risks. Conversely, other methods tend to be 
more sensitive, but way too un-specific. Note that this applies to Regression Survival Trees (RST) as well. 

5 Real Data Analysis 

Finally, we applied our Survival Bump Hunting (SBH) procedure to a publicly available real clinical dataset 
from the Women’s Interagency HIV cohort Study (WIHS) [^. It involves competing risks “AIDS/Death 
(before HAART)” and “Treatment Initiation (HAART)” during HIV-1 Infection in women. Here, for 
simplification purposes, only the first of the two competing events (the time to AIDS/Death) was used 
in our analysis. The data consists of n = 485 complete observations on the following p = 4 covariates in 
addition to the censoring indicator and (censored) time-to-event variables (Table [5]). 


Table 5: Women’s Interagency HIV Study (WIHS). Clinical dataset used with covariates description. 


Covariate Description 

Range 

AIDS/Death Diagnosis Time 

T (years) 

Event indicator variable 

C e {AIDS/Dead = 1, Censored = 0} 

Patient age at time of FDA approval of first protease inhibitor 

Age (years) 

Injection Drug Users (IDU) history 

IDU e {No history = 0, History = 1} 

Patients race 

Race e {Other = 0, African-American = 1}. 

CD4 count 

CD4 e [0, -l-Qo]/100 cells//rl 


All results in the WIHS clinical dataset were achieved using the “Replicated Combined CV” (RCCV) 
technique and the LRT statistic as peeling and optimization criterion, using K = 5, A = 1024 and B = 128. 
We show in Figure [12] the cross-validated tuning profile of LRT as a function of the number of peeling 
steps. According to our optimization criterion, we determined that the resulting “Replicated Combined 
CV” optimal length of the peeling trajectory is L”™ = 5 (Figure [T2|). 

We show in Table[6|the cross-validated decision rules and highest-risk box/group statistics at each step. 
Note that the box sample size in the final highest-risk box/group is n{l = 5) = 262 out of a total sample 
size of n = 485. Here, the cross-validated trace of covariate usage is: CDA, Age, Age, Age (Table [6|). 

Finally, we show in Figure [l3| the cross-validated Kaplan-Meir survival probability curves of the highest- 
risk group vs. lower-risk group at each step with their corresponding permutation p-values of separation. 
Notice how the curve separation increases with the peeling steps. The permutation p-values at each step 
are respectively: = 0) = 1, p^'’{l = 1 — 5) ^ 9.7e — 5 (Figure [T3|). 
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Peeling Steps 


Figure 12: Cross-validated tuning profile of the WfHS clinical dataset. The “Replicated Combined CV” cross-validated 
optimal peeling length (1,'"“^ = 5 j is shown with the vertical black dotted line. Each colored profile corresponds to one of the 
replications (B = 128). The cross-validated mean profile of the LRT statistic is shown by the solid black line with standard 
error of the sample mean. 


Table 6: Cross-validated decision rules (top) and highest-risk box/group statistics (bottom) of the WIHS clinical dataset. 
Values are sample mean estimates with corresponding standard errors in parenthesis. The box sample size at each step is also 
shown. Step #0 corresponds to the situation where the starting box covers the entire test-set data Ck before peeling. 


Step I Age I DU Race 

0 Age ^ 19.00 (0.00) IDU 5= 0.00 (0.00) Race ^ 1.00 (0.00) 

1 Age ^ 19.00 (0.00) IDU 5 = 0.00 (0.00) Race ^ 1.00 (0.00) 

2 Age ^ 22.07 (1.94) IDU is 0.00 (0.00) Race ^ 1.00 (0.00) 

3 Age ^ 23.30 (2.65) IDU is 0.00 (0.00) Race s: 1.00 (0.00) 

4 Age ^ 28.79 (0.51) IDU is 0.00 (0.00) Race s: 1.00 (0.00) 

5 Age ^ 29.22 (0.53) IDU is 0.00 (0.00) Race ^ 1.00 (0.00) 


CD4 

(7114 ^ 19.33 (0.00) 
CDi 8.64 (0.35) 
CDA C 8.51 (0.35) 
CD4: ^ 8.46 (0.16) 
CDA ^ 7.66 (0.83) 
(7H4 ^ 6.79 (0.77) 


Step 1 

n(l) 


Tr\i) 

pro 




0 

485 

(0.00) 

1.00 

(0.00) 

10.8 

(0.00) 

0.17 

(0.00) 

0.00 

(0.00) 

0.00 

(0.00) 

1.00 

(0.00) 

1 

436 

(0.00) 

0.90 

(0.00) 

10.8 

(0.00) 

0.17 

(0.00) 

0.61 

(0.02) 

16.90 

(1.41) 

0.46 

(0.00) 

2 

398 

(4.85) 

0.82 

(0.01) 

10.8 

(0.00) 

0.16 

(0.01) 

0.54 

(0.05) 

18.69 

(3.68) 

0.45 

(0.01) 

3 

364 

(9.70) 

0.75 

(0.02) 

10.8 

(0.00) 

0.14 

(0.01) 

0.61 

(0.07) 

29.28 

(7.32) 

0.43 

(0.01) 

4 

325 

(19.40) 

0.67 

(0.04) 

10.8 

(0.00) 

0.13 

(0.01) 

0.61 

(0.08) 

32.17 

(7.74) 

0.42 

(0.01) 

5 

262 

(33.95) 

0.54 

(0.07) 

10.8 

(0.00) 

0.11 

(0.01) 

0.61 

(0.10) 

32.51 

(10.86) 

0.42 

(0.01) 


The question of this study was whether it is possible to achieve a stratification or prognostication of 
patients for AIDS and HAART by using e.g. the ‘Injection Drug Users‘ (IDU) history. Overall, SBH 
shows that it is possible to achieve a stratification and prognostication of patients that are more likely 
to be diagnosed or die of AIDS than others. The decision rule identifies a subgroup of n(Z = 5) = 262 
such patients that should be treated more aggressively than others, e.g. by putting them on HAART 
treatment sooner. In addition, SBH reveals that these patients are characterized by a lower CD4: count of 
CD4, ^ 6.79(±0.77)/100 cells/^1 with an Age ^ 29.22(+0.53). Moreover, SBH reveals that ‘Injection Drug 
Users‘ history {IDU) was actually not the most useful covariate at this stage of determination. 

6 Discussion and Conclusion 

To build a survival bump hunting model, fit by a recursive peeling procedure, we used two sets of criteria 
with different purposes: (i) the peeling criteria for model fitting by maximization of the rate of increase in 
LHR, LRT or CHS statistics (section I2.2.3p . and (ii) the optimization criterion used for model tuning by 
maximization of the LHR or LRT or by minimization of the CER (section I3.2.3ji . 
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Figure 13: Kaplan-Meier plots of RCCV survival probability curves of the WIHS clinical dataset. Each plot represents a 
step of the peeling sequence. Step #0 corresponds to the situation where the starting box covers the entire test-set data 
Ck before peeling. The “in-box” legends (red) corresponds to the highest-risk box/group. Cross-validated LRT, LHR and 
permutation p-values of “in-box” samples are shown at the bottom of the plot with the corresponding peeling step for each 
method. P-values p“”{l) ^ 9.7e — 5 correspond to 1/lOth of the precision limit (see section 1X51) . Notice the single survival 
curve at Step #0 before peeling and how the survival curves of “in-box” and “out-of-box” samples separates as the peeling 
progresses. 


Despite being a well established concept as a splitting criterion in regression survival trees (section 
I2.2.3p . one criticism about the LRT is that it tends to favor continuous variables and causes some uneven 
splits (end-cut preference). In this study, the main difference that we observed between the three peeling 
criteria used so far (section I4.3.2P is in the conservativeness of the number of peeling steps or peeling 
sequence length (model complexity), and to a lower extent, of the number of used covariates (model size), 
regardless of the dimensionality of the data. In summary, we recommended using as peeling criterion 
primarily the Cumulative Hazard Summary CHS and secondarily the Log-Rank Test LRT to induce more 
conservative estimates and reduce the risk of over-htting, especially in high-dimensional data. 

In contrast, we noted that the overall effect of the peeling criteria is relatively marginal compared to 
that of the optimization criterion. In this study, we showed that the choice of the optimization criteria 
is crucial and dependent on the dimensionality of the data. In summary, we recommended using the 
Concordance Error Rate CER as optimization criterion in every situation (low- and high-dimensional 
data), or alternatively the Log-Rank Test LRT in low-dimensional situation only (section I4.3.ip . 

Overall, both of our replicated cross-validation techniques, namely the “Replicated Combined CV” 
(RCCV) and “Replicated Averaged CV” (RACV) techniques, were found effective at controlling (at least 
in part) the overhtting and under-htting issues, conhrming that these techniques are appropriate to the task 
of survival bump hunting modeling by a recursive peeling procedure. However, we observed differences in 
the cross-validation techniques, which raised the questions whether RACV could lend to over-htting more 
than RCCV, especially in high-dimensional data, and wether RACV performance could degrade faster than 
RCCV in situations with small sample sizes (see section [3. 2. 2p . For these reasons, we recommended using 
RCCV preferably to RACV. 

It is known that the stepwise covariate selection/usage procedure in the peeling loop of Algorithm [1] 
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induces an inflation of variance estimates primarily because of the adaptive nature of the algorithm (each 
peeling step is conditional on the previous step). Using replicated cross-validation is therefore recommended 
to reduce the variability of cross-validation estimates in recursive peeling methods. 

Survival estimates and model performance accuracy can be improved by the resampling technique used 
and resulting bias-variance trade-off. For instance, using a “larger” number of folds in iF-fold CV in the 
presence of “small” number of events or samples is known to reduce bias, but also increase variance of esti¬ 
mates. Beside our cross-validation techniques, the Leave-one-Out Cross-Validation (LOOCV - Jackknife) 
or bootstrap-based resampling techniques are available, such as the ordinary bootstrap j^, the 0.632 / 
Out-of-Bootstrap Cross-Validation (0.632 OOBCV - [3) or its latest variant (0.632-1- OOBCV 27]). As 
per Efron, ” the ordinary bootstrap gives an estimate of error with low variability, but with a possibly large 
downward bias, particularly in highly over-fitted situations^’’ [2j]. Conversely, cross-validation estimators 


are nearly unbiased (slightly upward), but have generally unacceptable high variability. For this reason, we 
chose to use a cross-validation procedure that could be replicated. The LOOCV and 0.632 OOBCV could 
be used instead of iL-fold CV, but have larger variance estimators. An alternative may be the 0.632-1- 
OOBCV estimator, which combines lower variance with a correction for bias [ 2 ^ . 

Collectively, our peeling strategy, combined with our model tuning/selection method along with cross- 
validation techniques support the claim that optimal survival bump hunting modeling can be done. Further, 
results also indicate that our strategies help our “Patient Recursive Survival Peeling” (PRSP) algorithm 
m use the most informative covariates in the decision rule. This suggests the possibility of carrying out a 
joint internal variable selection with our PRSP procedure. 

Some interesting differences between decision-tree and decision-box models lie in the weaknesses and 
strengths of the estimated solutions and in their applications. Here are some: (i) Stratification: if multiple 
groups are of interest, an advantage of recursive partitioning is that they directly lead to multi-group 
stratihcations of the data, instead of just presenting a rule for a single high (low) vs. low (high) response 
group; (ii) Interpretability: binary decision trees lead to an intuitive hierarchical interpretation of groups 
that facilitates their interpretation unlike peeling methods that are not constrained to a tree structure; (hi) 
Patience vs. Greediness: recursive partitioning methods are however notoriously “greedy” (exponential 
decrease of the data as the space undergo partitioning based on typically binary split), but recursive 
peeling methods can be made “patient” at will (quantile-controlled decrease of the data), eventually helping 
recursive top-down peeling algorithms such as ours to better learn from the data. 
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SUPPORTING INFORMATION 
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Supporting_Figure 1: Comparison of cross-validated tuning profiles of box end-point statistics between cross-validation tech¬ 
niques (overlaid: “Replicated Averaged CV” or RACV (black) vs. “Replicated Combined CV” or RCCV (red)) and opti¬ 
mization criteria (by rows: Log Hazard Ratio (LHR), Log-Rank Test (LRT) or Concordance Error Rate CER) in a given 
simulation model (by columns: simulation models 41=1, #2, #3 or #4J. Results are for the Log Hazard Ratio (LHR) peeling 
criterion. The resulting “Replicated CV” optimal peeling length L”“” (see eq. \19\) of the peeling trajectory is shown in each 
case (vertical dashed lines). Each dotted line corresponds to a cross-validated mean profile of the statistic used in the opti¬ 
mization criterion with the corresponding standard error of the sample mean, both calculated over the replications (B = 128J. 
Notice the situations of cross-validation success or failure as described in section [4. 3. J I and Eigure\^ Also, notice the expected 
increase of variance of cross-validated point estimates towards the right-end of the profiles corresponding to an increase in 
model uncertainty and regions of risk of overBtting. 
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Supporting_Figure 2: Comparison of cross-validated tuning profiles of box end-point statistics between cross-validation tech¬ 
niques (overlaid: “Replicated Averaged CV” or RACV (black) vs. “Replicated Combined CV” or RCCV (red)) and opti¬ 
mization criteria (by rows: Log Hazard Ratio (LHR), Log-Rank Test (LRT) or Concordance Error Rate CER) in a given 
simulation model (by columns: simulation models #1, #2, #3 or Results are for the Log-Rank Test (LRT) peeling 

criterion. The resulting “Replicated CV” optimal peeling length L”“” (see eg. \19\) of the peeling trajectory is shown in each 
case (vertical dashed lines). Each dotted line corresponds to a cross-validated mean profile of the statistic used in the opti¬ 
mization criterion with the corresponding standard error of the sample mean, both calculated over the replications (B = 128). 
Notice the situations of cross-validation success or failure as described in section [4. 3. i I and Eigure\^ Also, notice the expected 
increase of variance of cross-validated point estimates towards the right-end of the profiles corresponding to an increase in 
model uncertainty and regions of risk of overhtting. 
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Supporting_Figure 3: Comparison of cross-validated tuning profiles of box end-point statistics between cross-validation tech¬ 
niques (overlaid: “Replicated Averaged CV” or RACV (black) vs. “Replicated Combined CV” or RCCV (red)) and opti¬ 
mization criteria (by rows: Log Hazard Ratio (LHR), Log-Rank Test (LRT) or Concordance Error Rate CER) in a given 
simulation model (by columns: simulation models #2, #3 or Results are for the Cumulative Hazard Summary 

(CHS) peeling criterion. The resulting “Replicated CV” optimal peeling length L”“” (see eq. \19\) of the peeling trajectory is 
shown in each case (vertical dashed lines). Each dotted line corresponds to a cross-validated mean profile of the statistic used 
in the optimization criterion with the corresponding standard error of the sample mean, both calculated over the replications 
(B = 128 Notice the situations of cross-validation success or failure as described in section [4.3. J I and Figure\^ Also, notice 
the expected increase of variance of cross-validated point estimates towards the right-end of the proEles corresponding to an 
increase in model uncertainty and regions of risk of overStting. 
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Supporting_Figure 4: ProSles of coefficient of variation of Box Coefficient of Variation (BCV), survival end-points and predic¬ 
tion performance metrics. Comparative coefficient of variation profiles are shown for situations with decreasing sample sizes 
n G {50,100, 200}. Results are for simulated model #1 and the LRT statistic used in both peeling and optimization criteria. 










































































Supporting_Table 1: Effect of peeling and optimization criteria as well as cross-validation techniques on the cross-validated 
numbers of used covariates by the PRSP algorithm (see AlgorithmUJ) out of the total number of pre-selected ones (brackets). 
Numbers are reported for the combined effects of: (i) peeling criteria (by rows: Log Hazard Ratio (LHR), Log-Rank Test 
(LRT) or Cumulative Hazard Summary (CHS)), (ii) optimization criteria (by columns: Log Hazard Ratio (LHR), Log-Rank 
Test (LRT) or Concordance Error Rate (CER)), (in) cross-validation techniques (by columns: “Replicated Averaged CV” or 
RACV and “Replicated Combined CV” or RCCV), and (iv) the four tested simulation models (by rows: Model #1, #3 

or # 4 ;. 
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3[ 3] 3[ 3] 
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3[3] 
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Supporting_Figure 5: Comparison of replicated combined cross-validated results for the peeling trajectories in simulated model 
#4 for the “Replicated Combined CV” (RCCV) technique and the Cumulative Hazard Summary CHS as peeling criterion and 
the Concordance Error Rate CER as optimization criteria. Notice that covariates (xg, X32, X34, X72, X79) were those effectively 
used by the PRSP algorithm (see Algorithmic out of p = 1000 total covariates and p = 100 informative ones (see simulation 
design inD . 
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Supporting_Figure 6: Comparison of replicated combined cross-validated trace plots of covariate importance VI(l) (top) 
and covariate usage VU{1) (bottom) in simulated model #4 for the “Replicated Combined CV” (RCCV) technique and the 
Cumulative Hazard Summary CHS as peeling criterion and the Concordance Error Rate CER as optimization criteria. Notice 
that covariates (xe,X 32 ,X 34 ,X 72 ,X 79 ) were those effectively used by the PRSP algorithm (see Algorithm[l^ out of p = 1000 
total covariates and p = 100 informative ones (see simulation design inT) . 


Supporting_Table 2: Comparison of cross-validated decision rules (upper Supporting Table) and box end points statistics 
of interest (lower Supporting Table) in simulated model #4 for the “Replicated Combined CV” (RCCV) technique and the 
Cumulative Hazard Summary CHS as peeling criterion and the Concordance Error Rate CER as optimization criteria. For 
conciseness, only the initial and final decision rules (L”“”th step) are shown. Step #0 corresponds to the situation where the 
starting box covers the entire test-set data Ck before peeling. Values are sample mean estimates with corresponding standard 
errors in parenthesis. Notice that covariates (xe,X 32 ,X 34 ,X 72 ,X 79 ) were those effectively used by the PRSP algorithm (see 
Algorithm m out of p = 1000 total covariates and p = 100 informative ones (see simulation desisn\4^). 
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X79 ^ -1.83 (0.40) 

Step 1 

n{l) 

Tr^ii) 



e”“”(l) 

0 

model #4 

100 (0.00) 1.00 
89 (2.00) 0.89 
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